Predicting E. coli Promoter Strength With AI

Predicting E. coli Promoter Strength With AI

ISEF Category: Microbiology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Microbial Genetics  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A few DNA letters can change gene output a lot. Think of a promoter like a volume knob for a cell. Some turn genes up, some keep them low, and some barely work at all. If you can predict that knob from sequence alone, you are doing real genetic modeling.

What Is It?

A promoter is a short DNA region that tells a cell where to start copying a gene into RNA. In simple terms, it acts like a control switch. Strong promoters usually make more RNA, which can lead to more protein, while weak promoters make less.

Your project asks a machine learning model to read promoter DNA and guess how strong each promoter is. A transformer model looks for patterns in sequence, much like it learns patterns in language. Here, the “words” are DNA bases, and the output is a predicted expression level.

You can test the model on held-out synthetic promoters, which means sequences it never saw during training. Then you can compare its predictions to measured GFP fluorescence from engineered E. coli strains. GFP, or green fluorescent protein, glows when the gene is expressed, so fluorescence gives you a readout for promoter strength.

Why This Is a Good Topic

This makes a strong science fair topic because you can measure a clear input and output, DNA sequence and gene expression. You also get a real biology question that connects to synthetic biology, genetic engineering, and gene regulation. A student can learn data cleaning, model training, validation, and how to compare predictions with real lab measurements.

Research Questions

  • How does a small transformer model perform when predicting promoter strength from synthetic E. coli sequences?
  • What is the effect of adding reverse-complement augmentation on promoter prediction accuracy?
  • Does training on the Anderson promoter library improve prediction for held-out synthetic promoters?
  • To what extent do predicted promoter strengths match smartphone-based GFP fluorescence measurements in pGLO derivatives?
  • Which sequence features, such as motif position or GC content, most influence model predictions?
  • How does a transformer compare with a simple linear or random forest model on the same promoter dataset?

Basic Materials

  • Laptop with Python support.
  • Public promoter sequence dataset from the Anderson registry of standard biological parts.
  • Open-source machine learning library such as PyTorch or TensorFlow.
  • Spreadsheet software for tracking samples and results.
  • Smartphone with manual camera controls.
  • Blue or UV excitation source for fluorescence imaging.
  • Green fluorescent standard or reference sample.
  • Dark box or enclosure for consistent smartphone imaging.
  • Image analysis software such as ImageJ.
  • Basic lab notebook for recording sequence labels and fluorescence values.

Advanced Materials

  • University access to a plate reader or fluorescence microscope.
  • Engineered E. coli strains with pGLO-derived plasmids or similar GFP reporter constructs.
  • Molecular biology reagents for plasmid prep and verification.
  • DNA synthesis or cloning access for building synthetic promoter variants.
  • PCR and gel electrophoresis equipment.
  • Sterile culture materials and incubator access.
  • Fluorescence calibration standard.
  • High-resolution imaging setup for comparing reporter output.
  • Access to a GPU workstation for model training.
  • Version-controlled data storage for sequence and phenotype records.

Software & Tools

  • Python: Manages sequence processing, model training, and analysis.
  • PyTorch: Builds and trains the transformer model.
  • scikit-learn: Provides baseline models and evaluation metrics.
  • Biopython: Handles DNA sequence cleaning, encoding, and reverse complements.
  • ImageJ: Measures fluorescence intensity from smartphone images.

Experiment Steps

  1. Define the prediction task by choosing which promoter sequences count as training data and which count as held-out test data.
  2. Prepare the DNA sequences in a consistent format, then decide how you will encode them for the model.
  3. Build a simple baseline model first, so you have something fair to beat.
  4. Train the transformer, then plan how you will check for overfitting with cross-validation or a separate test set.
  5. Design a fluorescence validation plan that matches each engineered promoter to a comparable GFP readout.
  6. Set up an analysis plan that compares predicted strength, measured fluorescence, and model uncertainty.

Common Pitfalls

  • Mixing promoter names, sequence IDs, and sample labels, which breaks the link between training data and validation data.
  • Training and testing on nearly identical promoter variants, which makes accuracy look better than it really is.
  • Comparing fluorescence images taken under different camera settings, which changes intensity values without changing biology.
  • Ignoring sequence direction, which matters if the model does not see reverse complements consistently.
  • Treating raw fluorescence as promoter strength without correcting for background signal, culture differences, or image exposure.

What Makes This Competitive

A strong version of this project does more than train one model and report accuracy. You can compare several model types, test whether sequence features explain the predictions, and use a clean held-out set that prevents data leakage. The best entries also connect computation to biology by checking whether predicted promoter strength matches real reporter output in a thoughtful validation design. That kind of analysis shows you understand both the model and the gene regulation behind it.

Project Variations

  • Try the same prediction setup on constitutive promoters from another bacterial library and compare which sequence patterns transfer across datasets.
  • Replace the transformer with a convolutional neural network and ask whether a simpler architecture performs just as well on short promoter sequences.
  • Test whether promoter strength predictions improve when you add known motif features, such as -10 and -35 region scores, to the sequence model.

Learn More

  • Registry of Standard Biological Parts: Search the Anderson promoter collection and related part pages for sequence data and reported strengths.
  • PubMed: Search for review articles on promoter engineering, bacterial gene regulation, and sequence-based expression prediction.
  • NIH National Center for Biotechnology Information: Use NCBI Gene and nucleotide records to explore promoter annotations and DNA sequence context.
  • MIT OpenCourseWare: Look for free courses in machine learning, genomics, or computational biology to review model basics.
  • Bioinformatics journal: Search recent papers on promoter prediction and DNA sequence modeling for methods ideas.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart