HIV Resistance Mutation Prediction

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Computational Pharmacology · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

HIV can change fast enough to slip past some drugs. That means a single mutation can change how well treatment works. You can model that pressure with data instead of a wet lab. Your project becomes a question about prediction, not guesswork.

What Is It?

This project asks you to predict which HIV mutations are likely to appear when the virus faces drug pressure. HIV protease and reverse transcriptase are enzymes the virus needs to copy itself. Many antiretroviral drugs target those enzymes. When the virus changes those proteins, some drugs work less well.

A transformer model is a type of machine learning model that reads sequence data and learns patterns in order and context. Think of it like a smart pattern finder that can compare a mutation with the rest of the protein and the drug history around it. Instead of asking, “Does this mutation exist?”, you ask, “Given this treatment path, which resistance mutations are most likely next?”

Stanford HIVdb gives you a way to compare your model with a known resistance interpretation system. That makes the project more than a prediction demo. You can test whether your model learns useful biology from real sequence and drug-pressure data.

Why This Is a Good Topic

This is a strong science fair topic because it has clear input data, a clear output, and a real medical use case. You can test whether your model predicts resistance mutations better than simple baseline methods. You also get to work with messy biological data, which feels much closer to real research than a toy dataset. A student can learn sequence modeling, evaluation metrics, data cleaning, and how to think about clinical relevance.

Research Questions

How does adding drug-treatment history change the accuracy of a transformer model for predicting HIV resistance mutations?
What is the effect of using protease sequences versus reverse transcriptase sequences on mutation prediction performance?
Does a transformer outperform a simpler baseline model, such as logistic regression or random forest, on Stanford HIVdb benchmark data?
To what extent does training on simulated drug-pressure trajectories improve prediction of future resistance mutations?
Which mutation classes are easiest for the model to predict, major resistance mutations or accessory mutations?
How does class imbalance affect the model's ability to detect rare resistance mutations?

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Python installed with Jupyter Notebook or VS Code.
Public HIV sequence dataset from Stanford HIVdb or linked benchmark sources.
Free data analysis libraries such as pandas, NumPy, scikit-learn, and PyTorch or TensorFlow.
Spreadsheet software for tracking experiments and results.
Digital notebook or lab notebook for model logs and hyperparameters.

Advanced Materials

Access to a machine with a dedicated GPU or university computing cluster.
Python environment with PyTorch, Hugging Face Transformers, and scikit-learn.
Curated HIV protease and reverse transcriptase sequence dataset with treatment labels.
Stanford HIVdb interpretation outputs for benchmarking.
Git for version control and reproducibility tracking.
Statistical analysis tools such as R or Python SciPy for significance testing.

Software & Tools

Python: Builds the data pipeline, trains models, and runs evaluation scripts.
Jupyter Notebook: Lets you explore the dataset, test features, and document results in one place.
PyTorch: Trains transformer models and baseline neural networks.
scikit-learn: Runs baseline models, splits data, and calculates performance metrics.
ImageJ: Not needed for this project, so skip it unless you use it for a separate figure workflow.

Experiment Steps

Define the prediction target, such as the next resistance mutation or a mutation class, so your model has one clear job.
Curate a dataset that links viral sequence, drug exposure history, and resistance labels, then decide how you will split train, validation, and test sets.
Choose a baseline method first, so you can prove the transformer adds value beyond a simple model.
Design your sequence representation, including how you will encode amino acids, treatment steps, and mutation context.
Plan your benchmark metric, such as AUROC, F1 score, or top-k accuracy, and decide how you will handle rare mutation classes.
Set up a fairness check so you can test whether performance drops for certain drugs, mutation groups, or sequence lengths.

Common Pitfalls

Mixing sequence records from the same patient across train and test sets, which makes performance look better than it really is.
Treating every mutation as equally common, which hides severe class imbalance in HIV resistance data.
Using raw model scores without benchmarking against Stanford HIVdb, which leaves you with no biological reference point.
Encoding drug history too loosely, which can erase the pressure signal the model is supposed to learn.
Reporting one overall accuracy number, which can hide weak performance on rare but clinically useful mutations.

What Makes This Competitive

A stronger project would not stop at a basic prediction score. You would compare multiple model designs, test several split strategies, and show whether the model generalizes to unseen drugs or mutation combinations. You could also analyze which sequence positions matter most and whether the model flags known resistance pathways. That kind of analysis turns a coding project into a real research study.

Project Variations

Predict resistance mutations only for protease, then compare the result with reverse transcriptase.
Use amino-acid embeddings instead of raw one-hot encoding, then test whether the model learns better sequence context.
Predict the next mutation step under different simulated drug sequences, then compare short trajectories with longer ones.

Learn More

Stanford HIV Drug Resistance Database: Search for HIVdb, sequence interpretation, and resistance mutation resources from Stanford University.
NIH HIV Resistance Resources: Search NIH and NIAID pages for drug resistance overviews and clinical background.
PubMed: Search for review articles on HIV protease resistance, reverse transcriptase resistance, and transformer models in genomics.
NCBI Virus: Find viral sequence data and related annotation resources through NCBI's public databases.
Hugging Face Course: Read the free material on transformers and sequence modeling, then adapt the ideas to biological sequences.
MIT OpenCourseWare: Search for free machine learning lectures that cover sequence models, evaluation, and overfitting.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →