Spike Protein Evolution Forecasting
ISEF Category: Microbiology
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Virology · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
Viruses change fast, and a tiny amino acid swap can help them spread better or dodge immunity. That makes them a moving target, like a lock whose shape keeps changing. You can study those changes with public sequence data, not a wet lab. The challenge is to predict which residues are most likely to change next.
What Is It?
This project asks a simple question with a hard answer: can you predict where a virus will mutate next? You are not growing virus. You are reading its history from sequence databases and phylogenetic trees, which are family trees built from genetic data. Nextstrain helps you see how lineages spread and split over time. ESM2 is a protein language model, a type of AI that scores how natural or likely a protein sequence looks to the model.
Think of a protein like a sentence. Some changes keep the sentence readable, and some make it awkward. In biology, a readable change may still let the protein fold and work, while a bad change may break it. Your job is to compare sequence history, model scores, and known lineage changes to see whether the next likely mutation sites show up before they become common.
Why This Is a Good Topic
This is a strong science fair topic because it is testable with public data, and you can score your predictions against later variants. It connects to a real problem, which is how public health teams track viral evolution and prepare for new waves. You can learn sequence analysis, phylogenetics, model validation, and basic machine learning without needing access to a pathogen lab.
Research Questions
- How does ESM2 grammaticality score change before a residue becomes common in later SARS-CoV-2 lineages?
- What is the effect of using phylogenetic branch information on prediction accuracy for future spike mutations?
- Does a model trained on one virus season generalize to later influenza antigenic shifts?
- To what extent do high-conservation residues resist change across SARS-CoV-2 clades and influenza subtypes?
- Which combination of Nextstrain lineage data and protein language model scores best predicts future spike substitutions?
- How does prediction accuracy vary between receptor-binding sites and more conserved regions of the protein?
Basic Materials
- Laptop or desktop computer with internet access.
- Free account access to public sequence databases and browser-based tools.
- Nextstrain website and downloadable clade data.
- NCBI Virus or GISAID public metadata, if school access allows.
- ESM2 model access through a free notebook or local Python setup.
- Spreadsheet software for tracking residues and scores.
- Digital notebook for manual coding decisions and version tracking.
Advanced Materials
- Workstation with a GPU or access to a university compute cluster.
- Python environment with Biopython, pandas, NumPy, scikit-learn, and matplotlib.
- Local or cloud access to ESM2 model weights.
- Multiple sequence alignment software such as MAFFT.
- Phylogenetic analysis tools such as IQ-TREE or RAxML.
- Jupyter Notebook for reproducible analysis.
- Structural visualization software such as PyMOL or UCSF ChimeraX.
- Statistical testing tools for calibration and back-testing.
Software & Tools
- Python: Handles sequence parsing, scoring, plotting, and back-testing.
- Nextstrain: Shows viral phylogenies and lineage changes over time.
- Jupyter Notebook: Keeps your analysis readable, repeatable, and easy to explain.
- Biopython: Helps you load sequences, align residues, and map mutations.
- scikit-learn: Supports simple prediction models and accuracy checks.
Experiment Steps
- Define the exact protein region you will study, such as spike, hemagglutinin, or a binding subdomain.
- Collect a time-stamped sequence set and decide how you will split older lineages from newer ones for back-testing.
- Choose features that describe each residue, such as conservation, lineage frequency, and ESM2 grammaticality score.
- Build a prediction rule that ranks residues by how likely they are to change in later samples.
- Plan a back-test that compares your early predictions with mutations that appeared in later lineages.
- Decide how you will judge success, using rank correlation, precision at top sites, or another clear metric.
Common Pitfalls
- Using mixed sequence sources with different naming rules, which makes lineage labels and dates hard to compare.
- Scoring raw amino acid changes without aligning residues first, which shifts the position numbers and breaks your analysis.
- Training on future data by accident, which makes the model look better than it really is.
- Treating every mutation as equal, which hides the difference between common background changes and true shift sites.
- Comparing prediction results across influenza and SARS-CoV-2 without adjusting for their different mutation rates and protein structures.
What Makes This Competitive
A competitive project would do more than rank mutations. It would test whether one prediction signal beats another, such as conservation alone versus conservation plus language-model scoring. It would also use clean time splits, strong baselines, and statistics that show whether the forecast is better than chance. If you add a careful comparison across virus types or lineages, the project gets much stronger.
Project Variations
- Focus only on SARS-CoV-2 spike receptor-binding domain residues and test whether future Omicron-like changes were predictable from early lineage data.
- Switch to influenza hemagglutinin and compare seasonal drift in H1N1 versus H3N2 using the same scoring pipeline.
- Add structural context by checking whether predicted mutation sites cluster near antibody-binding surfaces or receptor-contact residues.
Learn More
- Nextstrain: Browse free visualizations of viral evolution and clade history on the Nextstrain website.
- NCBI Virus: Search public viral sequence records and metadata through the National Center for Biotechnology Information.
- PubMed: Look for review articles on antigenic drift, spike evolution, and protein language models.
- NIH Office of Data Science Strategy: Find resources on reproducible analysis and biomedical data practices.
- MIT OpenCourseWare, Computational Biology: Study free course materials on sequence analysis, algorithms, and evolutionary models.
- Nature and Science review articles: Search recent reviews on viral evolution, protein language models, and variant prediction.
