Predicting Noncoding Variant Pathogenicity With ML

ISEF Category: Cellular and Molecular Biology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genetics · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

A tiny DNA change outside a gene can still matter a lot. Think of it like changing a dimmer switch, not the light bulb itself. Those regulatory variants can change when, where, and how strongly a gene turns on. Your project can test whether machine learning can spot the risky ones before they cause harm.

What Is It?

This topic is about non-coding regulatory variants, which are DNA changes that do not alter a protein sequence but can still affect gene expression. Gene expression means how much RNA or protein a gene makes. A useful analogy is a thermostat in a house. The thermostat does not make heat, but it controls when the furnace runs. Regulatory variants can change those control signals.

Your goal is to train a machine learning model to predict which variants are more likely to be pathogenic, meaning linked to disease. ENCODE gives you clues about regulatory activity, like chromatin marks and transcription factor binding. GTEx gives you eQTL data, which means variants tied to changes in gene expression in human tissues. ClinVar gives labeled variants, including ones that were reclassified over time, so you can test whether your model tracks better with newer evidence than older labels.

Why This Is a Good Topic

This is a strong science fair topic because you can build a real prediction pipeline from public data, compare models, and measure performance with clear statistics. The question is testable, and the data are already available, so you do not need a wet lab to start. The project connects to real clinical genetics, since many disease-linked variants are non-coding and hard to interpret. You can also learn data cleaning, feature engineering, model evaluation, and how to think about changing scientific labels over time.

Research Questions

How does adding ENCODE regulatory features change model accuracy for non-coding variant pathogenicity prediction?
What is the effect of using GTEx eQTL evidence on classification performance for tissue-specific regulatory variants?
Does a model trained on ClinVar labels from 2018 predict 2025 reclassifications better than a model trained on older labels?
To what extent do models differ when you evaluate them by tissue type, such as brain, liver, or blood?
Which feature group, ENCODE-only, GTEx-only, or combined data, gives the best separation between pathogenic and benign variants?
How does class imbalance affect precision, recall, and false positive rate in non-coding variant prediction?
What is the effect of using strict variant overlap rules on the stability of model results?

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Stable internet connection for downloading public genomics datasets.
Spreadsheet software for tracking samples and labels.
Python installed with pandas, numpy, scikit-learn, and matplotlib.
Jupyter Notebook or Google Colab for code and notes.
Public variant files from ClinVar.
ENCODE annotation tracks or summary tables.
GTEx eQTL summary data.
File storage folder with versioned subfolders for raw data, processed data, and results.

Advanced Materials

Access to a university or cloud computing environment with larger memory and storage.
Python environment with xgboost, lightgbm, shap, and scikit-learn.
Genome annotation tools such as bedtools and pybedtools.
Access to UCSC Genome Browser track files for cross-checking variant locations.
ClinVar release archives from multiple years.
ENCODE and GTEx bulk download files for tissue-specific features.
Variant effect score resources for comparison, such as CADD or deep learning prediction tables from public sources.
Git for version control and reproducible analysis.

Software & Tools

Python: Cleans data, builds models, and runs statistics for your prediction pipeline.
Jupyter Notebook: Keeps code, notes, plots, and results in one place.
scikit-learn: Trains baseline classifiers and evaluates model performance.
pandas: Organizes ClinVar, ENCODE, and GTEx tables into analysis-ready formats.
UCSC Genome Browser: Helps you confirm that variants line up with the right regulatory regions.

Experiment Steps

Define the exact prediction task, including what counts as pathogenic, benign, and reclassified over time.
Choose one variant set and one feature set first, so you can test your pipeline before scaling up.
Decide how you will map each variant to ENCODE and GTEx signals, including tissue matching rules.
Build a baseline model, then compare it against a version with added regulatory features.
Plan a validation scheme that keeps variants from the same gene or region from leaking across train and test sets.
Predefine the metrics you will report, such as AUROC, precision, recall, and calibration.

Common Pitfalls

Mixing ClinVar labels from different release years without tracking which version each sample came from, which makes your benchmark inconsistent.
Using overlapping training and test variants from the same gene or regulatory region, which inflates accuracy.
Treating every non-coding variant as if it has the same tissue context, which hides tissue-specific effects.
Joining ENCODE and GTEx annotations with the wrong genome build or coordinate system, which scrambles your features.
Reporting only overall accuracy, which can look strong even when the model misses rare pathogenic variants.

What Makes This Competitive

A stronger version of this project goes beyond one model and one score. You can test whether newer ClinVar reclassifications match your predictions better than older labels, then break performance down by tissue, variant type, and regulatory feature group. You can also compare simple models with more advanced ones and use careful validation to avoid data leakage. That kind of analysis shows you understand both genetics and machine learning, not just how to run code.

Project Variations

Focus only on variants near liver-expressed genes and test whether tissue matching improves prediction.
Compare rule-based scoring against machine learning on the same non-coding variant set.
Add conservation features, such as phylogenetic scores, and test whether they improve ENCODE and GTEx-based predictions.

Learn More

ENCODE Project: Search the ENCODE portal for regulatory annotation data, chromatin marks, and transcription factor binding summaries.
GTEx Portal: Search the GTEx site for tissue-specific eQTL resources and expression data.
ClinVar: Use the NCBI ClinVar database to find variant classifications and review reclassification history.
NIH Genetic and Rare Diseases Information Center: Read plain-language background on how variants can affect disease.
PubMed: Search for review articles on non-coding variants, eQTLs, and pathogenicity prediction models.
UCSC Genome Browser: Check genomic coordinates and visualize regulatory tracks across tissues.

Cellular and Molecular Biology Category Guide

How to Do Real Cellular and Molecular Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →