Predicting Enhancer Variant Effects

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genetics · Difficulty: Advanced · Setup: Home Setup · Time: 1 to 2 Months

The Hook

A single DNA change can act like a dimmer switch, turning an enhancer up, down, or off. That matters because enhancers help control when genes turn on. With public MPRA data, you can train a model to predict those effects instead of guessing from the sequence alone. This turns a genetics question into a real machine-learning project.

What Is It?

Enhancers are short stretches of DNA that help control gene activity. You can think of them like volume knobs for genes. A tiny variant inside an enhancer can change how strongly that knob works, which can shift how much of a gene gets made.

MPRA stands for massively parallel reporter assay. In this type of experiment, researchers test many DNA sequences at once and measure which ones raise or lower activity. Your project uses those public results as training data, then asks whether machine learning can predict enhancer impact from sequence features, like motif changes, GC content, or conservation.

The cool part is that you are not just labeling variants as good or bad. You are trying to predict how big the effect is, which makes the project more realistic and more useful.

Why This Is a Good Topic

This is a strong science fair topic because you can test a clear prediction, measure performance with real numbers, and compare different modeling choices. It connects to gene regulation, which matters in disease, development, and animal biology. You can work with public data, so you do not need a wet lab, but you still get to do original analysis and build a model that answers a real biology question.

Research Questions

How does adding motif-change features affect the accuracy of a model that predicts enhancer activity from variant sequence??
What is the effect of using only sequence-based features versus sequence plus conservation features on prediction performance??
Does a random forest model outperform linear regression for predicting MPRA-measured enhancer activity??
To what extent does model performance change when you train on one MPRA dataset and test on another??
Which feature groups, such as GC content, motif disruption, or k-mer counts, contribute most to enhancer activity prediction??
How does variant location within the enhancer, near the center versus near the edge, affect predicted impact??

Basic Materials

Laptop or desktop computer with at least 8 GB RAM.
Internet access for downloading public MPRA datasets.
Spreadsheet software for tracking samples and metadata.
Python installed with pandas, scikit-learn, NumPy, and matplotlib.
Optional Jupyter Notebook for cleaner analysis and plots.
Public MPRA dataset files from a peer-reviewed paper or database.
Reference genome and annotation files from a public source such as UCSC or Ensembl.
Basic text editor for cleaning sample lists and notes.

Advanced Materials

Workstation or lab computer with more RAM for larger feature matrices.
Python with scikit-learn, XGBoost, SHAP, pandas, NumPy, SciPy, and matplotlib.
GPU access if you want to test a neural network model.
FASTA files for reference sequences and matched control regions.
Genome annotation tracks for enhancers, transcription factor motifs, and conservation scores.
Linux shell tools for data prep and reproducible pipelines.
Version control repository for code, data dictionaries, and results.
Access to additional public datasets for external validation.

Software & Tools

Python: Runs data cleaning, feature engineering, model training, and evaluation.
Jupyter Notebook: Keeps code, notes, and plots in one place.
pandas: Organizes MPRA tables, labels, and feature matrices.
scikit-learn: Builds baseline models and cross-validation tests.
matplotlib: Makes plots for prediction error, feature importance, and model comparison.

Experiment Steps

Define the exact prediction target, such as binary enhancer effect or continuous activity change.
Choose one public MPRA dataset and decide how you will split training, validation, and test data.
Build a feature set from sequence and annotation signals, then check that every variant has matching labels.
Train a simple baseline model first, so you have something honest to beat.
Compare a few model types and evaluate them with the same metrics and the same held-out test set.
Inspect the strongest errors, then refine the feature set or split strategy to see what the model is missing.

Common Pitfalls

Mixing variants from the same enhancer across train and test sets, which inflates accuracy.
Using raw sequences with inconsistent orientation, which can scramble feature extraction.
Treating all MPRA datasets as interchangeable, which hides platform-specific signal differences.
Judging the model only by training score, which makes overfitting look like success.
Ignoring class imbalance or noisy labels, which can make a weak model seem better than it is.

What Makes This Competitive

A stronger project goes beyond a single model fit. You can compare multiple feature sets, test whether the model generalizes across datasets, and explain which sequence signals matter most. The best versions also stress-test the model with careful splits, external validation, and error analysis on hard cases. That kind of design shows real research thinking, not just software use.

Project Variations

Use human disease-linked enhancer variants instead of general regulatory variants to ask whether clinically relevant changes are easier to predict.
Compare MPRA data from different cell types to see whether enhancer rules stay the same across biological contexts.
Predict enhancer activity with k-mer features only, then compare that against motif-based features to see which representation works better.

Learn More

PubMed: Search for review articles on enhancer function, MPRA methods, and regulatory variant prediction.
NIH National Library of Medicine: Search PubMed and related genetics resources for primary studies on gene regulation.
ENCODE Project: Find public annotations for regulatory elements, transcription factor binding, and chromatin state.
UCSC Genome Browser: Inspect genomic regions, conservation tracks, and known enhancer annotations.
Nature Methods: Search for MPRA papers and method comparisons in the journal archive.
Genome Biology: Search for studies on regulatory variants, enhancer modeling, and functional genomics.

Animal Sciences Category Guide

How to Do Real Animal Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →