Microbial Dark Matter Protein Discovery

ISEF Category: Microbiology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Most of the proteins in microbes have no clear name or job yet. That means huge parts of microbial biology still look like a map with blank spaces. You can use computer tools to group those unknown proteins, then ask what one mystery cluster might do. That is real discovery work, not just database searching.

What Is It?

Microbial dark matter means the huge set of microbial genes and proteins that scientists can see in sequence data, but cannot yet explain. Think of it like a library full of books with torn covers and missing labels. You can still sort them by language style, theme, and structure, even if you do not know the title.

In this project, you would use protein embeddings from ESM2, which turn each protein sequence into numbers that capture pattern and similarity. Then you would cluster the unknown families in IMG/M, a large metagenomics database, to find groups that seem related. After that, you would use ESMFold, a protein structure prediction tool, to inspect representatives from the most novel cluster. The goal is to propose a likely function class, like transport, binding, or enzyme-like activity, based on sequence and structure clues.

Why This Is a Good Topic

This is a strong science fair topic because it starts with a huge open question and turns it into a clear analysis pipeline you can test. You can compare clustering methods, novelty filters, and structure-based annotations, which gives you real choices for original work. The project also connects to antibiotic discovery, enzyme mining, and microbiome research. A student can learn modern bioinformatics, data cleaning, clustering, and structure prediction without needing a wet lab.

Research Questions

How does the choice of clustering algorithm change the number of unannotated protein families found in IMG/M?
What is the effect of embedding model choice on how well unknown proteins separate into distinct clusters?
Does removing proteins with weak homolog hits increase the structural consistency of the top novel cluster?
To what extent do predicted ESMFold structures support one candidate function class over another for the same protein cluster?
Which similarity threshold best balances cluster size and annotation novelty for microbial dark matter proteins?
How does cluster purity change when you filter by protein length before embedding?
To what extent do the top novel clusters differ in predicted secondary structure content from annotated families?

Basic Materials

Computer with strong internet access and at least 16 GB RAM.
Free account access to IMG/M or a downloadable protein set from a public microbial database.
Python installed with NumPy, pandas, scikit-learn, and matplotlib.
Access to a free GPU platform if available, such as Google Colab.
A text editor or notebook environment, such as Jupyter Notebook.
Spreadsheet software for tracking cluster labels and candidate proteins.

Advanced Materials

Workstation or server with a GPU for running ESM2 embeddings at scale.
Local storage for large protein sequence files and embedding outputs.
Python environment with PyTorch, Biopython, scikit-learn, and UMAP.
ESM2 model weights and ESMFold access through a research computing setup.
BLAST+ or MMseqs2 for homolog searches against public sequence databases.
Structural visualization software such as PyMOL or UCSF ChimeraX.

Software & Tools

Python: Handles sequence parsing, embedding workflows, clustering, and plotting.
Jupyter Notebook: Lets you document each analysis step and inspect results as you go.
Google Colab: Gives you free notebook-based compute for smaller embedding tests.
ESMFold: Predicts protein shapes so you can compare structure across a mystery cluster.
MMseqs2: Searches for homologs fast and helps you filter out proteins with known relatives.

Experiment Steps

Define the protein set you will analyze, and decide how you will filter for truly unclassified families.
Choose one embedding strategy first, then plan a fair comparison against at least one alternative representation.
Set your clustering rule, and define what counts as a novel cluster with no annotated homologs.
Build a scoring plan that ranks clusters by novelty, size, internal consistency, and structural agreement.
Select representative proteins from the top cluster, then plan how you will compare predicted folds and domains.
Decide how you will turn the results into a candidate function class that is supported by both sequence and structure evidence.

Common Pitfalls

Using proteins with hidden annotations, which makes a cluster look novel when it already has known homologs.
Mixing protein fragments with full-length proteins, which can distort embeddings and cluster boundaries.
Treating every isolated outlier as a discovery, which often just means the model saw a noisy sequence.
Skipping a homolog search cutoff plan, which makes novelty claims impossible to defend.
Comparing predicted structures without controlling for protein length, which can make unrelated folds look similar.

What Makes This Competitive

A class project might stop at clustering unknown proteins. A stronger project asks whether the cluster really is novel, whether more than one method agrees, and whether structure prediction supports the same story. You can raise the level by comparing multiple embedding or clustering settings, using stricter homolog filters, and testing cluster stability. If you connect your result to a plausible function class with careful evidence, the project starts to look like real discovery work.

Project Variations

Focus on proteins from one environmental source, such as soil, ocean, or gut metagenomes, to see whether novel clusters differ by habitat.
Compare ESM2-based clustering with a simpler sequence similarity method to test whether embeddings recover more hidden families.
Add domain prediction and secondary structure analysis to see whether the top unknown cluster shares a common fold pattern.

Learn More

NCBI BLAST: Search for homologous proteins and check whether your top cluster already has known relatives, using the BLAST tools page at NCBI.
IMG/M: Explore metagenome-derived gene and protein sets, using the Integrated Microbial Genomes and Microbiomes database from JGI.
ESM Resources: Read model documentation and examples for protein language embeddings and structure prediction from Meta's public ESM project pages.
PubMed: Search for review articles on protein language models, metagenomic dark matter, and structure-based function prediction.
NIH Bookshelf: Find free background reading on molecular evolution, protein structure, and bioinformatics methods.
MIT OpenCourseWare: Use free molecular biology, genetics, and computational biology course materials for background understanding.

Microbiology Category Guide

How to Do Real Microbiology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →