Phage Host Prediction With Tail Fiber CNNs

Phage Host Prediction With Tail Fiber CNNs

ISEF Category: Microbiology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Virology  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A single amino acid change can help a phage infect a new host. That tiny swap is like changing one tooth on a key, then seeing whether the lock still turns. You can test that idea with sequence data, machine learning, and protein structure models. This project sits right where biology and code meet.

What Is It?

Bacteriophages, or phages, are viruses that infect bacteria. Many phages use tail fibers or receptor-binding proteins to grab onto a host cell. Those proteins act a bit like a grappling hook. If the hook shape changes, the phage may bind a different bacterial genus or fail to bind at all.

Your project asks two linked questions. First, can a convolutional neural network, or CNN, learn patterns in tail-fiber sequences that predict host genus from INPHARED data? Second, if the model flags sequence changes tied to host shifts, do those changes cluster in loop regions on an AlphaFold protein structure? Loops often sit on the outside of a protein and contact the host, so they make a good target for testing structure-function ideas.

Why This Is a Good Topic

This is a strong science fair topic because it gives you a clear prediction problem, a clear biological question, and a clean way to compare model output with protein structure. You can test whether sequence patterns carry host information, then check whether the most informative changes land in specific structural regions. The topic also connects to phage therapy, bacterial diagnostics, and host range prediction. A student can learn data cleaning, model training, feature interpretation, and structure mapping in one project.

Research Questions

  • How does a CNN’s host-genus prediction accuracy change when you train it on tail-fiber sequences from different phage families?
  • What is the effect of removing very similar sequences from the training set on host-genus prediction accuracy?
  • Does a CNN identify sequence positions that overlap with AlphaFold-modeled receptor-binding loops more often than expected by chance?
  • To what extent do predicted host-switch single substitutions cluster in exposed loop regions rather than buried residues?
  • Which phage families produce the most accurate host-genus predictions from tail-fiber sequence alone?
  • How does adding simple sequence features, such as amino acid composition, change performance compared with a sequence-only CNN?

Basic Materials

  • Computer with enough memory to run Python and small machine learning models.
  • Python with common data science packages.
  • INPHARED phage sequence dataset or a similar public phage-host database.
  • FASTA sequence files for tail-fiber or receptor-binding proteins.
  • Spreadsheet software for tracking samples and labels.
  • Basic sequence alignment tool or command-line FASTA utilities.
  • Free protein structure viewer such as PyMOL or UCSF ChimeraX for academic use.
  • Access to a public AlphaFold structure database or predicted structure files.

Advanced Materials

  • University or research lab access to a workstation with a GPU.
  • Curated phage-host sequence set with taxonomic labels and metadata.
  • Protein family annotation tools for identifying tail-fiber and receptor-binding regions.
  • Multiple sequence alignment software for conserved-site analysis.
  • Structural analysis software for solvent accessibility and loop annotation.
  • Scripts for in silico mutagenesis and saliency mapping.
  • Statistical software for permutation tests and effect size calculations.
  • Access to literature databases for validating host-range annotations.

Software & Tools

  • Python: Runs your data cleaning, model training, and evaluation code.
  • PyTorch or TensorFlow: Builds and trains the CNN on sequence data.
  • Biopython: Parses FASTA files and handles basic sequence operations.
  • ImageJ: Not used for images here, but helpful only if you later compare structural snapshots visually; most students will not need it.
  • UCSF ChimeraX: Visualizes AlphaFold structures and helps you map predicted mutation sites onto loops.
  • AlphaFold Database: Provides predicted protein structures you can inspect without running a full structure prediction pipeline.

Experiment Steps

  1. Define the prediction task by choosing one host label, one protein region, and one filtering rule for close duplicates.
  2. Build a clean sequence-label table and decide how you will split training, validation, and test sets without data leakage.
  3. Choose a baseline model and a CNN model so you can prove the machine learning step adds value.
  4. Plan a way to score model output, then decide how you will translate important sequence positions into biological sites.
  5. Map high-signal positions onto AlphaFold structures and decide how you will count loop, surface, and buried residues.
  6. Design a permutation or randomization test that checks whether host-switch substitutions cluster more than expected by chance.

Common Pitfalls

  • Mixing closely related phage sequences across train and test sets, which makes the model look better than it really is.
  • Using host labels that are too broad or too noisy, which hides any real sequence signal.
  • Training on full viral genomes instead of the right protein region, which dilutes the host-binding pattern.
  • Treating AlphaFold coordinates as proof of function, when they only give a predicted structure.
  • Ignoring class imbalance, which can make a model look accurate while it mostly guesses the most common host genus.

What Makes This Competitive

A class-level version of this project stops at model accuracy. A stronger version asks whether the model’s top signals map to a real biological mechanism. You can raise the level by testing several phage families, using strict sequence-similarity splits, and comparing CNN results with a simpler baseline. If you also run a permutation test on loop clustering, your project moves from prediction to mechanism.

Project Variations

  • Train the same CNN on capsid or baseplate proteins instead of tail fibers to see which region best predicts host genus.
  • Switch the label from host genus to host species for a harder and more detailed classification task.
  • Compare AlphaFold loop clustering across temperate and lytic phages to see whether host-switch patterns differ by lifestyle.

Learn More

  • NCBI Virus: Search viral genomes and metadata, then filter for phages with host information.
  • PubMed: Search review articles on phage host range, receptor-binding proteins, and tail fibers.
  • INPHARED project papers: Read the dataset description and methods in the original phage-host database publications, which you can find by searching PubMed or Google Scholar.
  • NCBI Protein and Gene databases: Check annotations for phage structural proteins and taxonomic labels.
  • AlphaFold Protein Structure Database: Open predicted protein structures and inspect loop regions for candidate binding sites.
  • MIT OpenCourseWare, Machine Learning: Use lecture notes on basic model evaluation, overfitting, and validation if you need a refresher on the ML workflow.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart