Mycobacterium Genome Classification with ML Models

ISEF Category: Microbiology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Bacteriology · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A single genome can hide in millions of DNA letters, but machine learning can still spot its pattern. That is the challenge here. You will teach a model to tell Mycobacterium apart from other bacteria using public whole-genome data. Then you will test whether simple features beat a deeper neural network.

What Is It?

This project asks a simple question with a smart twist: can a computer recognize bacteria from their DNA alone? You will use public whole-genome sequencing, or WGS, data. WGS gives you nearly all the DNA from an organism, like getting a full book instead of a few torn pages.

One way to read that book is with k-mers, which are short DNA chunks of fixed length. If certain chunks show up often, the genome leaves a signature. Another way is to turn read alignments into pileup images, which are visual summaries of how sequencing reads stack on a reference genome. A logistic regression model uses the k-mer counts as a baseline. A CNN, or convolutional neural network, looks for image patterns in the pileups. Your job is to compare the two and see which one classifies Mycobacterium better, especially M. smegmatis and other non-tuberculous, BSL-1 relatives.

Why This Is a Good Topic

This is a strong science fair topic because it gives you a clear yes-or-no prediction, real public data, and room for serious analysis. You can test whether feature-based methods or image-based methods work better, and you can measure performance with accuracy, precision, recall, and confusion matrices. The topic also connects to real microbiology, since fast species identification matters for environmental studies, clinical screening, and sequencing workflows. You can learn bioinformatics, machine learning, and model evaluation without culturing dangerous organisms.

Research Questions

How does k-mer length affect classification accuracy for identifying Mycobacterium from public WGS data?
What is the effect of using logistic regression versus a CNN on read-pileup images for genus-level classification?
Does adding more non-Mycobacterium background species improve false-positive control?
To what extent does class imbalance change precision and recall for M. smegmatis detection?
Which genomic representation, k-mer counts or pileup images, gives better separation between environmental Mycobacterium and other bacteria?
How does training on one public dataset and testing on a separate dataset affect model generalization?
What is the effect of reference genome choice on pileup-image classification performance?

Basic Materials

A laptop or desktop computer with at least 16 GB RAM.
Public WGS access from NCBI SRA or ENA.
NCBI Assembly and RefSeq records for label checking.
Python installed with a scientific package manager.
Pandas for table handling.
NumPy for array work.
scikit-learn for logistic regression and evaluation.
Jupyter Notebook or JupyterLab for analysis notes.
FASTA and FASTQ readers such as Biopython or pysam.
Enough local storage for sequencing files and derived feature matrices.

Advanced Materials

A workstation with a modern GPU for CNN training.
Public WGS data from NCBI SRA, ENA, or RefSeq with curated metadata.
A reference genome for M. smegmatis and close relatives.
Bowtie2 or BWA for read alignment before pileup generation.
samtools and bcftools for alignment processing and variant inspection.
PyTorch or TensorFlow for CNN development.
ImageJ or Python imaging tools for checking pileup-image quality.
High-performance storage for multiple train-test splits.
A reproducible workflow system such as Snakemake or Nextflow.
Version control with Git for tracking model changes.

Software & Tools

Python: Runs feature extraction, model training, and evaluation scripts.
scikit-learn: Fits the logistic-regression baseline and reports classification metrics.
PyTorch: Builds and trains the CNN on pileup-image inputs.
Biopython: Parses FASTA, FASTQ, and annotation files during data prep.
samtools: Processes alignments before you convert them into pileup-based features.

Experiment Steps

Define the exact prediction task, such as genus-level detection or M. smegmatis versus other bacteria.
Choose a public data set with clear labels and a separate holdout set for final testing.
Design two feature pipelines, one based on k-mer counts and one based on read-pileup images.
Build a simple baseline first so you can measure whether the CNN adds real value.
Plan evaluation metrics that punish false positives and class imbalance, not just raw accuracy.
Lock the train-test split before tuning, then compare models with the same data partitions.

Common Pitfalls

Using poorly curated public labels, which can turn the entire model into a species-name guessing game.
Training and testing on overlapping genomes, which inflates performance and hides poor generalization.
Letting class imbalance dominate the results, which can make a weak model look strong on paper.
Changing read-depth or alignment settings between samples, which creates artifacts in pileup images.
Tuning the CNN more heavily than the logistic regression baseline, which makes the comparison unfair.

What Makes This Competitive

A strong version of this project goes beyond basic accuracy. You compare models on a truly separate test set, track false positives, and explain where each model fails. You can also test transfer across species, strain groups, or sequencing depths. That kind of careful evaluation shows real scientific judgment, not just code running.

Project Variations

Use environmental non-mycobacterial bacteria as the negative class, then test whether the model still spots Mycobacterium cleanly.
Swap pileup images for coverage and mismatch summary plots, then compare whether the simpler visual encoding matches CNN performance.
Focus only on M. smegmatis strain-level classification, then ask whether the model learns species identity or just sequencing artifacts.

Learn More

NCBI SRA: Search for public whole-genome sequencing runs and download raw reads for labeled bacterial datasets.
NCBI RefSeq: Find curated bacterial reference genomes and annotations for label checking and alignment.
NIH PubMed: Search review articles on bacterial whole-genome sequencing, k-mers, and machine learning classification.
NCBI Datasets: Pull genome sequences and metadata for bacteria without scraping multiple sites.
MIT OpenCourseWare: Look for free courses in machine learning, computational biology, and biological data analysis.
USGS Microbiology resources: Use public microbial method pages and context for environmental bacteria when you need background.

Microbiology Category Guide

How to Do Real Microbiology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →