Predict Plasmid Host Range With AI

Predict Plasmid Host Range With AI

ISEF Category: Microbiology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point.But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Microbial Genetics  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

Some DNA acts like a hitchhiker. Plasmids can move between bacteria and carry traits like antibiotic resistance. If you could predict which bacteria a plasmid can live in from sequence alone, you would be studying one of the biggest questions in microbial genetics. That makes this a strong mix of biology and machine learning.

What Is It?

Plasmids are small circles of DNA that sit outside the main bacterial chromosome. Think of them like portable instruction manuals. A plasmid can carry genes that help bacteria survive stress, share nutrients, or resist antibiotics.

Plasmid host-range prediction asks a simple question with a hard answer, which bacteria can a plasmid infect or persist in? The host range is the set of bacterial genera, or broader groups, that can support that plasmid. Your model tries to read the plasmid sequence and guess the host genus, which is a sequence classification problem. The challenge is that DNA carries clues in many forms, including gene content, motifs, and overall patterns, not just one obvious marker.

In this project, you would train a model on known plasmid-host pairs from a database such as PLSDB, then test whether it can predict the host genus for new plasmids it has never seen. You can compare a fine-tuned ESM2 model, which is a language model for protein or nucleotide-like biological sequences, against an existing method such as gPlas. The goal is not just to get a high score. You want to understand when the model succeeds, when it fails, and what sequence features seem to matter most.

Why This Is a Good Topic

This is a strong science fair topic because you can measure model performance with clear numbers, and you can ask questions that matter in real microbiology. Plasmid host prediction connects to antibiotic resistance spread, bacterial ecology, and genome surveillance. You can learn real bioinformatics skills, like dataset cleaning, train-test splits, benchmarking, and error analysis, without needing to grow bacteria in a wet lab.

Research Questions

  • How does fine-tuning an ESM2 model change host-genus prediction accuracy compared with a baseline method like gPlas?
  • What is the effect of plasmid length on host-genus prediction performance?
  • Does adding gene-content features improve prediction more than using sequence alone?
  • To what extent does prediction accuracy drop for rare host genera with few training examples?
  • Which plasmid families are most often misclassified by the model?
  • How does performance change when you hold out the newest 2024-deposited plasmids for testing?

Basic Materials

  • Laptop or desktop computer with enough storage for sequence files and model outputs.
  • Python installed through Anaconda or Miniconda.
  • Jupyter Notebook or VS Code for analysis.
  • Internet access for downloading public plasmid databases and documentation.
  • PLSDB sequence and metadata files from the public database.
  • A spreadsheet program such as Google Sheets or LibreOffice Calc for tracking samples and results.
  • Basic reference set of known bacterial taxonomy labels from NCBI Taxonomy.

Advanced Materials

  • University or lab workstation with a modern GPU for model fine-tuning.
  • Python environment with PyTorch, Hugging Face Transformers, scikit-learn, pandas, NumPy, and Biopython.
  • Access to a high-performance storage drive for large sequence datasets.
  • Public plasmid-host benchmark data from PLSDB and related metadata tables.
  • NCBI Taxonomy database files for genus-level label cleaning.
  • gPlas outputs or source code for method comparison.
  • Software for sequence alignment or clustering, if you plan to test similarity leakage.

Software & Tools

  • Python: Runs data cleaning, model training, and evaluation scripts for the plasmid dataset.
  • Jupyter Notebook: Helps you explore sequences, labels, and model errors step by step.
  • Hugging Face Transformers: Provides ESM2 fine-tuning tools and pretrained sequence models.
  • scikit-learn: Calculates classification metrics, confusion matrices, and baseline models.
  • Biopython: Reads sequence files and parses FASTA records and metadata tables.

Experiment Steps

  1. Define the exact prediction task, including host-genus labels, sequence filters, and what counts as a valid test case.
  2. Build a clean dataset from public plasmid records, then remove duplicate or near-duplicate sequences that could leak into testing.
  3. Choose a baseline model and a sequence model, then decide how you will keep training and evaluation splits fair.
  4. Plan a benchmark that compares overall accuracy, class imbalance effects, and performance on rare genera.
  5. Design a holdout test using newer plasmids, so you can measure whether the model generalizes beyond older database entries.
  6. Prepare an error-analysis plan that groups mistakes by plasmid length, host frequency, and plasmid family.

Common Pitfalls

  • Using random train-test splits without removing close sequence duplicates, which makes the model look better than it really is.
  • Mixing genus labels from different taxonomy sources, which creates hidden label noise.
  • Training on the same plasmid families that appear in the test set, which inflates host prediction scores.
  • Reporting only accuracy, which hides poor performance on rare host genera.
  • Skipping error analysis, which leaves you unable to explain why the model confuses certain hosts.

What Makes This Competitive

A strong version of this project goes past a basic accuracy score. You would compare multiple models, control for duplicate sequences, and test whether the model still works on newer plasmids it never saw during training. You would also look at failure cases, especially rare hosts and closely related genera. That kind of analysis shows real understanding, not just a working script.

Project Variations

  • Try the same host prediction task on plasmids from a single bacterial family, then compare whether narrower taxonomy improves accuracy.
  • Swap genus prediction for plasmid mobility prediction, and test whether the same sequence features help.
  • Compare sequence-only models against models that add gene annotations, then measure which feature set helps most on rare hosts.

Learn More

  • NCBI Taxonomy Database: Use this to check and clean genus labels for your plasmid-host pairs. Search the NCBI site for taxonomy resources.
  • PLSDB: A public plasmid database with sequence and metadata records. Search for PLSDB and its download page.
  • PubMed: Search for review articles on plasmid host range, plasmid ecology, and sequence-based host prediction.
  • Genome Research: Search this journal for studies on plasmid classification, microbial genomics, and sequence models.
  • MIT OpenCourseWare, Introduction to Computational Biology: Use this for free background on sequence data analysis and biological modeling. Search the MIT OpenCourseWare site.
  • NCBI Bookshelf: Find free textbook chapters on bacterial genetics, plasmids, and microbial evolution. Search the NCBI Bookshelf site.
Shopping Cart