Predicting Autism Noncoding Variant Effects

Predicting Autism Noncoding Variant Effects

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genomics  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

Most disease-linked DNA changes do not sit inside genes. They hide in the switches that turn genes on and off. That makes autism genetics harder to study, but also richer for a student project. You can ask how well modern models predict which non-coding variants matter.

What Is It?

This project asks a simple question with a hard answer: if a DNA change does not alter a protein, can you still predict whether it changes gene control? Non-coding regulatory variants sit in stretches of DNA that act like switches, dimmers, and routing labels. They can affect when a gene turns on, where it turns on, and how strongly it responds. In autism research, many of the strongest clues live in these regions.

Think of the genome like a huge building with millions of light switches. Protein-coding variants may break the bulb itself. Non-coding variants often damage the switch, the wiring, or the label on the switch panel. ENCODE and Roadmap Epigenomics provide maps of those switches across cell types. Enformer-style embeddings turn long DNA sequences into numeric features, and a fine-tuned MLP, which is a simple neural network, can learn patterns that separate likely functional variants from less likely ones.

Why This Is a Good Topic

This is a strong science fair topic because the question is testable with public data, and the analysis has real biomedical value. You can compare different feature sets, model types, and cell-type tracks to see what improves prediction. You can also learn data cleaning, model evaluation, and error analysis, which are core research skills. The project scales well, so you can make it simple or push it toward a deeper machine learning study.

Research Questions

  • How does adding ENCODE chromatin accessibility tracks change prediction of non-coding variant effects in autism-associated loci?
  • What is the effect of using Roadmap Epigenomics tissue-specific tracks versus generic genomic features on model accuracy?
  • Does an Enformer embedding improve classification of regulatory variants compared with one-hot sequence features alone?
  • To what extent does fine-tuning a multilayer perceptron on SPARK and SSC cohort labels improve separation of known risk and control variants?
  • Which epigenomic marks most strongly increase prediction confidence for variants near autism-associated genes?
  • How does model performance change when you restrict training to brain-related cell types versus all available cell types?

Basic Materials

  • Computer with internet access and enough memory for local analysis.
  • A Python environment such as Anaconda or Google Colab.
  • Python packages for data analysis, machine learning, and genomics work.
  • Public variant annotation tables from ENCODE, Roadmap Epigenomics, and ClinVar or another curated variant source.
  • Autism-associated locus lists from public studies or databases.
  • Spreadsheet software for tracking samples, labels, and model outputs.
  • Version control with Git or a simple project folder system.

Advanced Materials

  • Access to a Linux workstation or university compute cluster.
  • GPU access for embedding extraction or model training.
  • FASTA reference genome files.
  • BED files for regulatory regions, enhancers, and epigenomic peaks.
  • Chromatin track datasets from ENCODE and Roadmap Epigenomics.
  • Curated SPARK and SSC-derived summary labels or public proxy labels if direct cohort data are not available.
  • Tools for variant annotation, sequence extraction, and genomic interval operations.

Software & Tools

  • Python: Runs data cleaning, feature building, model training, and evaluation.
  • Jupyter Notebook: Lets you document each analysis step and keep code, notes, and figures together.
  • PyTorch: Supports building and fine-tuning the multilayer perceptron and embedding-based models.
  • scikit-learn: Helps with train-test splits, baseline models, metrics, and cross-validation.
  • UCSC Genome Browser: Lets you inspect regulatory regions and compare variants with epigenomic tracks.

Experiment Steps

  1. Define the exact prediction task, such as classifying variants as likely functional or likely neutral within autism-associated loci.
  2. Choose one clear label source and one comparison set so your model has a defensible ground truth.
  3. Build a feature table that separates sequence-only inputs from epigenomic and embedding-based inputs.
  4. Set up a baseline model first, then add ENCODE, Roadmap, and Enformer features one group at a time.
  5. Plan a validation strategy that checks for data leakage, cohort overlap, and overfitting.
  6. Decide how you will interpret results, such as feature importance, error analysis, or subgroup performance by cell type.

Common Pitfalls

  • Mixing variants from the same genomic region across train and test splits, which can make accuracy look higher than it really is.
  • Using labels from mixed sources without checking how each source defined functional and non-functional variants.
  • Comparing models with different feature counts without a matched baseline, which hides whether the gain came from biology or model size.
  • Ignoring tissue specificity, which can wash out a brain-relevant signal in autism loci.
  • Treating a high probability score as proof of biological effect, which turns a prediction task into an overclaim.

What Makes This Competitive

A class-level version of this project might compare two models on a public dataset. A stronger version asks a sharper question, like which tissue tracks matter most for brain-linked regulatory variants, or whether one model class fails on certain genomic neighborhoods. You can raise the level by using strict held-out testing, proper negative controls, and multiple metrics instead of accuracy alone. Careful error analysis can also reveal where the model struggles, which is often more useful than a single score.

Project Variations

  • Test whether brain-specific epigenomic tracks outperform pan-tissue tracks for predicting regulatory variants in autism loci.
  • Compare Enformer embeddings against simpler sequence k-mer features to see whether deep sequence representations add real value.
  • Focus on rare non-coding variants from one cohort and ask whether model confidence changes with allele frequency or genomic distance to the nearest gene.

Learn More

  • ENCODE Project: Search the ENCODE portal for chromatin accessibility, histone marks, and regulatory annotations in human tissues.
  • Roadmap Epigenomics Project: Find tissue-specific epigenomic maps through the NIH Roadmap data resources.
  • NCBI ClinVar: Search ClinVar for curated variant interpretations and supporting evidence.
  • UCSC Genome Browser: Inspect regulatory regions, conservation tracks, and gene neighborhoods for your variants.
  • MIT OpenCourseWare, 6.864 or similar machine learning courses: Review core ideas in model evaluation, overfitting, and feature design.
  • PubMed: Search for review articles on autism genetics, non-coding variation, and regulatory genomics.
Shopping Cart