E. coli Toxin-Antitoxin Genome Discovery

ISEF Category: Microbiology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Bacteriology · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Bacteria carry tiny survival modules that can help them shut down, pause, or fight stress. Some of these modules hide in plain sight inside genome databases. You can search for them without touching a petri dish. That means your project can start with public data and still ask a real discovery question.

What Is It?

A toxin-antitoxin system is a paired gene setup in bacteria. One gene can hurt or slow the cell, and the other gene can block that effect. Think of it like a locked box with two keys, one that can stop the box from opening, and one that can open it if conditions change.

Your project asks whether a poorly annotated protein family hides new toxin-antitoxin pairs in E. coli. You would scan thousands of genomes for matches to a protein model, then check whether the genes sit next to partner genes in a pattern that fits a toxin-antitoxin system. That nearby-gene pattern is called synteny, which means gene order and neighborhood.

Then you would use predicted protein structures to guess how the candidate works. Structure can hint at whether a protein looks like a toxin, an antitoxin, or something else entirely. The final result is not a wet-lab answer. It is a data-driven prediction about a possible new biological system.

Why This Is a Good Topic

This is a strong science fair topic because you can ask a real discovery question with public data, clear filters, and measurable outcomes. It connects to antibiotic tolerance, stress response, and bacterial survival, which are real problems in medicine and microbiology. You can learn genome mining, gene neighborhood analysis, and structure prediction, all of which are useful research skills. The project also has room for original insight if you compare candidate clusters, not just count hits.

Research Questions

How does the distribution of Pfam hits vary across E. coli RefSeq genomes??
What is the effect of gene neighborhood conservation on confidence that a hit belongs to a toxin-antitoxin system??
Does the predicted protein structure of candidate hits resemble known toxin folds more than known antitoxin folds??
To what extent do candidate loci cluster into distinct synteny groups across E. coli strains??
Which genomic features best predict whether a candidate hit is likely to be functional rather than a false positive??
How does the presence of nearby mobile-element genes affect the probability that a candidate locus is a toxin-antitoxin system??

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Reliable internet access for downloading RefSeq genome metadata and protein sequences.
Python installed with bioinformatics packages for parsing, filtering, and plotting.
HMMER command-line tools for profile searches.
Clinker for gene neighborhood comparison.
ColabFold access through a Google account for structure prediction.
Spreadsheet software for tracking candidate loci and scoring rules.
NCBI RefSeq genome access and genome annotation files.
Sequence viewer or text editor for inspecting gene calls and headers.

Advanced Materials

Workstation or access to a computing cluster for large batch searches.
Local copy of RefSeq genome proteins and annotations.
HMMER profile files for the target Pfam family.
Custom Python scripts for deduplication, hit ranking, and synteny scoring.
Clinker for comparative gene-cluster visualization.
ColabFold or local AlphaFold-style environment for structure prediction.
Structural visualization software such as PyMOL or ChimeraX.
Statistical analysis environment such as R or Python with SciPy and pandas.
Public databases such as UniProt, Pfam, NCBI Gene, and PDB for annotation cross-checks.

Software & Tools

HMMER: Searches genome protein sets for matches to a profile hidden Markov model.
Clinker: Compares gene neighborhoods so you can group candidate loci by synteny.
ColabFold: Predicts protein structures from amino acid sequences to support mechanism guesses.
Python: Organizes genome hits, filters false positives, and makes plots.
ImageJ: Not needed for this project, so skip it unless you need to inspect exported figures.

Experiment Steps

Define the biological question and decide what counts as a candidate toxin-antitoxin locus.
Build a genome list and choose one reference Pfam profile or seed set for your search.
Set filters for hit quality, gene neighborhood rules, and duplicate genome handling.
Group candidate hits by synteny so you can separate repeated families from one-off false positives.
Predict structures for the strongest candidates and compare folds against known toxin and antitoxin classes.
Plan a scoring scheme that combines sequence, neighborhood, and structure evidence into one ranking.

Common Pitfalls

Treating every HMMER hit as a real gene, which floods the analysis with weak false positives.
Ignoring genome redundancy, which makes one strain lineage look like many independent discoveries.
Comparing candidates without checking gene order, which breaks the logic of a toxin-antitoxin pair.
Trusting one structure prediction without comparing it to known folds, which can overstate the mechanism claim.
Mixing annotation versions across genomes, which changes gene boundaries and makes synteny plots hard to interpret.

What Makes This Competitive

A stronger project would not stop at finding hits. It would build a repeatable scoring system that weighs sequence quality, gene neighborhood, and structure together. You could also compare your candidates against known toxin-antitoxin families to see whether they form a new class or a known one in disguise. Careful false-positive control and clear visual summaries would make the story much stronger.

Project Variations

Focus on one E. coli pathotype group instead of all RefSeq genomes to test whether candidate loci are enriched in a disease-associated lineage.
Compare the Pfam hit patterns in E. coli with those in Salmonella or Shigella to see whether the synteny signature is conserved across related bacteria.
Add a protein-disorder or domain-architecture analysis to test whether candidate antitoxins look more flexible than candidate toxins.

Learn More

NCBI RefSeq: Search genome records and annotation files for E. coli and related bacteria on the NCBI genome and assembly pages.
Pfam: Read family descriptions and seed alignments for the poorly annotated protein family on the Pfam website.
HMMER documentation: Learn how profile searches work from the HMMER user guide and command documentation.
Clinker paper and documentation: Find gene-cluster comparison examples in the peer-reviewed article and project docs.
ColabFold paper: Read the method paper for structure prediction and find the ColabFold notebook in the authors' public materials.
UniProt and PDB: Check known protein functions and structures through the UniProt database and the Protein Data Bank.

Microbiology Category Guide

How to Do Real Microbiology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →