Predicting Promoter Strength in Crops

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genomics · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A promoter is like a gene’s volume knob. Turn it up, and a plant can make more of a protein. Turn it down, and expression drops. With modern DNA models, you can estimate that effect from sequence alone, even for crops that were barely studied before.

What Is It?

Promoters are short DNA regions that help control when a gene turns on and how strongly it turns on. Think of them like a dimmer switch for a light. The DNA sequence inside a promoter affects how well cellular machinery can start copying the nearby gene into RNA.

Foundation models such as Nucleotide-Transformer and Evo are large machine learning models trained on huge amounts of DNA. They learn patterns in sequence the way a language model learns grammar. In this project, you ask whether those models can predict promoter strength in non-model crops like millet and teff, even before any new wet-lab data exist. You then compare those predictions with public MPRA datasets. MPRA means massively parallel reporter assay, a method that tests many DNA sequences at once and measures how strongly each one drives expression.

Why This Is a Good Topic

This is a strong science fair topic because the question is testable with public data, clear metrics, and real biology behind it. You can compare model predictions against known promoter measurements, test whether the model works better on some sequence types than others, and ask whether crop-specific DNA behaves differently from training data. The project connects to crop improvement, gene regulation, and synthetic biology, but you can still do the core analysis from a laptop if you have the right data and coding skills.

Research Questions

How does model choice affect zero-shot promoter strength prediction in non-model crops?
What is the effect of promoter GC content on prediction error in millet and teff?
Does performance change when you compare crop promoters with promoters from model plant species?
To what extent do Nucleotide-Transformer and Evo agree on the strongest promoter sequences?
Which promoter features best explain where the models fail on public MPRA datasets?
How does transfer performance change when you train on one public MPRA dataset and test on another?

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Stable internet access for downloading public datasets and models.
Python installed through Anaconda or Miniconda.
Jupyter Notebook or JupyterLab for analysis.
Spreadsheet software for tracking datasets and results.
Public MPRA datasets from journals or repositories.
Reference genome or promoter sequence files for millet and teff, if available.
Basic code editor such as VS Code.

Advanced Materials

Access to a GPU workstation or university compute cluster.
Python environment with PyTorch, Hugging Face Transformers, NumPy, pandas, SciPy, scikit-learn, and matplotlib.
Sequence alignment tools for checking promoter annotations.
R or Bioconductor for additional statistics and visualization.
Genome annotation files for millet, teff, and related grasses.
Curated benchmark set of public MPRA promoter measurements.
Version control with Git for reproducible model comparisons.

Software & Tools

Python: Runs data cleaning, model scoring, and statistical analysis for promoter sequences.
Jupyter Notebook: Lets you document code, plots, and results in one place.
Hugging Face Transformers: Provides access to pretrained DNA foundation models for sequence scoring.
scikit-learn: Helps you calculate metrics like correlation, error, and classification performance.
ImageJ: Can help inspect figure exports and spot plotting issues before you submit.

Experiment Steps

Define the exact prediction task, such as ranking promoter strength or estimating a continuous activity score.
Collect a benchmark set of public MPRA sequences, then decide how you will split data to avoid leakage.
Choose a baseline method, then compare it with one or more foundation models on the same test set.
Build a scoring plan that turns model outputs into a standard metric such as correlation or mean error.
Add sequence feature checks, such as GC content or motif presence, so you can explain model behavior.
Plan one cross-dataset or cross-species test to see whether the model generalizes beyond the data it saw during training.

Common Pitfalls

Mixing promoters from different assay types, which makes model scores hard to compare.
Using overlapping train and test sequences, which inflates performance through data leakage.
Treating one public dataset as if it represents all plant promoters, which hides generalization problems.
Ignoring sequence length differences, which can bias the model toward easier examples.
Reporting only one metric, which can miss failures on weak or mid-strength promoters.

What Makes This Competitive

A strong version of this project does more than report one accuracy number. You can test multiple models, compare them against simple baselines, and explain where each one fails. You can also ask a harder question, like whether models trained on generic plant DNA still work on underrepresented crops. Clear splits, careful statistics, and a thoughtful error analysis can make the project feel much more like real research.

Project Variations

Test whether promoter prediction works better for grasses than for more distant plant species.
Compare sequence-only models with models that also use motif or k-mer features.
Analyze whether short promoter fragments or full promoter regions produce stronger prediction scores.

Learn More

PubMed: Search for review articles on plant promoter architecture, MPRA methods, and DNA foundation models.
NIH PubMed Central: Read full-text open-access papers on promoter prediction and plant genomics.
NCBI Gene and Nucleotide databases: Find annotated sequences and gene context for plant promoter regions.
Google Scholar: Search for recent papers on Nucleotide-Transformer, Evo, and promoter activity prediction.
MIT OpenCourseWare: Look for free courses on machine learning, genomics, or computational biology to support your analysis.
Nature Methods and Genome Biology: Search these journals for primary papers on MPRA and sequence-to-function modeling.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →