Deep Learning Nanobody Ranking for Spike Variants

Deep Learning Nanobody Ranking for Spike Variants

ISEF Category: Biomedical Engineering

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Synthetic Biology  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A tiny antibody can stop a virus before it gets in. That makes nanobodies a huge deal for fast-moving viruses like SARS-CoV-2 and RSV. The hard part is not just making candidates, but ranking which ones are most likely to bind well before anyone enters the lab. Your project can build that ranking system with free protein AI tools.

What Is It?

This project builds a computer pipeline for finding and ranking nanobody candidates. Nanobodies are very small antibody fragments that can bind to a viral protein, kind of like a lock fitting a key. You start with protein sequences, use a language model such as ESM2 to score which sequences look protein-like, use ProteinMPNN to suggest or refine sequences, and then use AlphaFold-Multimer to estimate whether the nanobody and viral spike protein might fit together.

Think of it like a custom filter stack. First, you ask, “Does this sequence look like a real folded protein?” Then you ask, “Can I design a version that should fold better?” Finally, you ask, “Does the nanobody seem to sit near the target site in a plausible way?” The benchmark step compares your ranked candidates against known nanobody and coronavirus antibody data in CoV-AbDab, a database of published antibody sequences and annotations.

Why This Is a Good Topic

This is a strong science fair topic because you can test a real ranking problem with public data and free tools. You are not just running a model. You are comparing design choices, scoring methods, and benchmark results. That gives you measurable outputs, like rank correlation, enrichment of known binders, and agreement between different prediction steps. The topic also connects to a real medical need, fast antibody design for changing viruses, while still letting you do the work from a computer.

Research Questions

  • How does adding ProteinMPNN-designed variants change the rank order of nanobody candidates compared with using raw sequence scores alone?
  • What is the effect of using AlphaFold-Multimer interface confidence metrics on the enrichment of known binders from CoV-AbDab?
  • Does ranking nanobodies by ESM2 embeddings improve separation between known binders and decoys?
  • To what extent do predictions for SARS-CoV-2 spike variants transfer to RSV spike targets with the same pipeline?
  • Which scoring combination best predicts structural plausibility for nanobody-target complexes?
  • How does the choice of spike variant region change the model’s ranking stability across candidate nanobodies?

Basic Materials

  • A computer with a modern GPU or access to a university or school workstation.
  • A stable internet connection for downloading public protein datasets.
  • Python installed with a scientific environment such as Conda or Anaconda.
  • A spreadsheet program for tracking sequences, scores, and benchmark labels.
  • Access to CoV-AbDab or exported antibody sequence tables.
  • Enough local storage for protein model outputs and intermediate files.

Advanced Materials

  • Access to a Linux workstation or university cluster.
  • NVIDIA GPU with enough memory for protein model inference.
  • Python environment with PyTorch, Biopython, pandas, NumPy, and plotting libraries.
  • Local installation or access to ESM2 model weights.
  • ProteinMPNN and AlphaFold-Multimer ready for batch runs.
  • Curated positive and negative sequence sets from CoV-AbDab and related databases.
  • Tools for structure inspection such as PyMOL or ChimeraX.

Software & Tools

  • Python: Runs data cleaning, scoring, plotting, and model orchestration for the pipeline.
  • ESM2: Scores protein sequences and produces embeddings that help compare nanobody candidates.
  • ProteinMPNN: Suggests sequences that are more likely to fold well around a chosen protein backbone.
  • AlphaFold-Multimer: Predicts protein complex structures and interface confidence for nanobody-target pairs.
  • PubMed: Helps you find papers on nanobody design, spike binding, and benchmark methods.

Experiment Steps

  1. Define the target comparison you want to make, such as raw sequence ranking versus a full design-and-folding pipeline.
  2. Assemble a clean training and benchmark set from CoV-AbDab and related public antibody data.
  3. Choose the scoring features you will compare, including sequence likelihood, design score, and complex confidence.
  4. Plan a fair negative set so your model is tested against realistic non-binders, not easy decoys.
  5. Decide how you will measure success, such as enrichment, precision at top ranks, or ROC-AUC.
  6. Build a validation plan that checks whether the top-ranked candidates agree across multiple prediction methods.

Common Pitfalls

  • Using sequences from the same family in both train and test sets, which inflates your benchmark scores.
  • Comparing candidate nanobodies with random decoys, which makes the task easier than real discovery.
  • Ignoring duplicate or nearly duplicate antibody entries in CoV-AbDab, which biases ranking results.
  • Treating a high AlphaFold-Multimer confidence score as proof of binding, which it is not.
  • Mixing targets and numbering schemes without careful cleanup, which makes alignment and site comparison unreliable.

What Makes This Competitive

A class-level version of this project would just run one model and report a score. A stronger version compares several ranking strategies, uses strict train-test separation, and checks whether the pipeline really enriches known binders. You can make it more advanced by testing transfer across different viral targets, analyzing failure cases, and using statistics that measure ranking quality instead of only top-score examples. That shows real model thinking, not just software running.

Project Variations

  • Rank nanobody candidates for one SARS-CoV-2 spike variant, then test whether the ordering changes when the spike sequence changes.
  • Swap the target from SARS-CoV-2 to RSV and compare how well the same scoring pipeline transfers.
  • Compare structure-based ranking against sequence-only ranking to see which method better enriches known binders.

Learn More

  • CoV-AbDab: Search the database of coronavirus antibodies for known nanobody and antibody sequences, then use the records as benchmark data.
  • ESM2 paper in Science: Read about protein language models and find the paper through PubMed or the journal site.
  • ProteinMPNN paper in Science: Learn how sequence design from structure works, then search PubMed for the article.
  • AlphaFold-Multimer paper in Nature: Find the complex prediction method and its benchmark discussion through PubMed or the journal site.
  • NIH PubMed: Search review articles on nanobody engineering, spike binding, and antibody design benchmarks.
  • MIT OpenCourseWare: Look for free courses on machine learning, molecular biology, or computational biology to fill in background gaps.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Hub →

Shopping Cart