Microbial Source Tracking Bias in 16S Data
ISEF Category: Microbiology
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
A microbe can tell you where it came from, almost like a fingerprint. That sounds simple until the model starts working better for some places, or some groups, than others. You can test that with public 16S data and a careful bias audit. This project mixes microbiology, machine learning, and ethics in one question.
What Is It?
Microbial source tracking asks a simple question, where did this microbial sample come from? In fecal source tracking, you train a model on 16S rRNA data, which is a common way to identify bacteria in a sample by reading a marker gene. The model learns patterns in the bacterial mix and tries to guess the source, such as human, cow, pig, or wastewater.
Think of it like a music app that recognizes a song from a few notes. The model does not hear the whole song. It looks for a pattern that matches what it learned before. SourceTracker2 is a classic tool for this kind of problem. A neural baseline is a second model that uses machine learning to make the same kind of prediction, so you can compare methods and see where each one works well or fails.
The bias part asks a different question. Does the model do equally well on samples from different regions, study groups, or sequencing setups? If the answer is no, then the model may not generalize well. That makes this a strong project for both microbiology and responsible AI.
Why This Is a Good Topic
This is a good science fair topic because you can use public data, clear labels, and measurable performance metrics. You do not need to grow bacteria yourself, but you still work with a real microbiology problem that matters for water quality, food safety, and public health. You can test whether one method predicts source better than another, and whether performance changes across data subgroups. That gives you a clean path to original analysis.
Research Questions
- How does SourceTracker2 compare with a neural baseline for classifying fecal source on public 16S data? ?
- What is the effect of training set diversity on source-tracking accuracy across study sites? ?
- Does model performance change when you test on samples from different geographic regions? ?
- To what extent does sequencing platform or pipeline choice affect classifier agreement? ?
- Which sample classes produce the most false positives in fecal source tracking? ?
- How does class imbalance change sensitivity for rare source categories? ?
Basic Materials
- Computer with internet access and enough memory to handle tabular microbiome data.
- Public 16S rRNA datasets from repositories such as Qiita, MG-RAST, or NCBI Sequence Read Archive.
- Python installed with common data science packages.
- Spreadsheet software for tracking samples, labels, and metadata.
- Access to SourceTracker2 documentation and example workflows.
- Metadata table with sample origin, region, and study source.
- Notebook software for code, figures, and notes.
Advanced Materials
- University workstation or cloud compute with higher RAM for larger feature tables.
- Curated multi-study 16S datasets with harmonized taxonomic tables.
- Bioconductor or QIIME 2 outputs for preprocessing comparisons.
- Python deep learning libraries such as TensorFlow or PyTorch.
- Confusion-matrix and calibration analysis scripts.
- Statistical testing tools for subgroup comparison and fairness metrics.
- Version control system for code and analysis tracking.
Software & Tools
- Python: Cleans metadata, builds feature tables, and runs the baseline model.
- SourceTracker2: Provides a classic fecal source-tracking method for comparison.
- QIIME 2: Helps prepare and inspect 16S feature tables and taxonomy outputs.
- pandas: Organizes sample metadata and model outputs for analysis.
- scikit-learn: Supports train-test splits, baseline models, and performance metrics.
Experiment Steps
- Define the prediction task and decide which source classes and metadata fields you will include.
- Gather public 16S datasets that share enough metadata to support a fair comparison.
- Standardize the feature table so different studies can be compared on the same scale.
- Build a baseline workflow with SourceTracker2 and a second model that uses the same input data.
- Plan subgroup tests that compare accuracy across geography, study source, or sequencing platform.
- Choose metrics that capture both accuracy and bias, then organize your results table before running the full analysis.
Common Pitfalls
- Mixing studies with different taxonomic levels, which makes the model learn dataset artifacts instead of source patterns.
- Ignoring metadata gaps, which leaves you with subgroup tests that have too few samples to mean anything.
- Comparing models on different feature sets, which makes one method look better for the wrong reason.
- Treating a high overall accuracy as proof of fairness, which can hide poor performance on smaller groups.
- Overfitting the neural baseline on a tiny dataset, which makes the test scores look strong but fail on new samples.
What Makes This Competitive
A strong version of this project goes beyond a simple accuracy comparison. You would test several data splits, report subgroup performance, and explain why the model fails in certain settings. You could also compare calibration, not just accuracy, so you know whether the model's confidence matches reality. The best projects ask whether a method is useful across contexts, not just whether it works once.
Project Variations
- Use wastewater and river samples instead of stool-adjacent samples to test how the model behaves in environmental monitoring.
- Replace the neural baseline with a random forest or support vector machine to compare classic machine learning against SourceTracker2.
- Add a domain-shift analysis that trains on one region or study and tests on another to measure generalization.
Learn More
- NIH Human Microbiome Project: Search the project site for public microbiome datasets and background on 16S data.
- NCBI Sequence Read Archive: Find raw sequencing studies and sample metadata for building your dataset.
- Qiita documentation: Look for tutorials on microbiome study processing and sample metadata handling.
- SourceTracker2 paper in Microbiome: Search PubMed or journal pages for the original method and validation details.
- Nature Biotechnology and Microbiome: Search for review and methods papers on microbiome source tracking and model evaluation.
- NOAA Water Quality resources: Use government background material to connect source tracking to real pollution monitoring.
