Protein Embeddings for Olfactory Receptor Adaptation

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genetics · Difficulty: Advanced · Setup: Home Setup · Time: 1 to 2 Months

The Hook

Some animals can smell the same world very differently because a few receptor proteins changed over time. You can turn those proteins into numeric fingerprints with ESM-2, a model trained on millions of sequences. Then you can test whether receptors from different species cluster by habitat, diet, or lineage.

What Is It?

ESM-2 is a protein language model. That means it reads amino-acid sequences the way a text model reads words, then converts each protein into an embedding, which is a compact set of numbers that captures pattern and context. For olfactory receptors, those numbers can act like a fingerprint for how a receptor family has changed across species.

In this project, you are not trying to prove that one smell receptor is better. You are asking whether species with different diets, habitats, or social behaviors leave different traces in the receptor family. If the embedding space separates those groups, you may have a clue that adaptation left a measurable signal in the sequence itself.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with public sequence data, clear labels, and measurable outputs. It connects to evolution, sensory biology, and how animals find food, mates, and danger. You can learn how to clean data, compare models, and judge whether a pattern is real or just noise.

Research Questions

How does embedding distance between olfactory receptor proteins change across species with different diets?
What is the effect of habitat type on clustering of ESM-2 embeddings within one olfactory receptor family?
Does a classifier trained on embeddings predict species group better than one trained on raw sequence identity?
To what extent do embedding clusters match known phylogenetic groups after controlling for receptor subfamily?
Which receptor subfamilies show the strongest separation by species trait in embedding space?
How does removing duplicated receptor copies affect the adaptation signal you measure?

Basic Materials

Laptop with at least 16 GB RAM or a cloud notebook account.
Internet access for public sequence databases.
Spreadsheet software for tracking species, genes, and traits.
Python 3.11 with a notebook environment.
FASTA files or exports from NCBI Protein or UniProt.
A text editor for notes and code.

Advanced Materials

High-memory workstation or GPU server.
Local copy of ESM-2 weights or another protein embedding model.
Curated ortholog set for one olfactory receptor family.
Multiple sequence alignment tool such as MAFFT.
Phylogenetic analysis software such as IQ-TREE or RAxML.
Version-controlled metadata table with species traits and gene IDs.

Software & Tools

Python: Runs sequence parsing, embedding extraction, and statistical tests.
Jupyter Notebook: Keeps your analysis, notes, and plots in one place.
Biopython: Reads FASTA files, filters sequences, and manages protein metadata.
Hugging Face Transformers: Loads ESM-2 or similar models for embedding generation.
scikit-learn: Trains classifiers and measures how well embeddings separate groups.

Experiment Steps

Define one receptor family, one species set, and one adaptation trait so your comparison stays focused.
Build a clean metadata table that matches each sequence to species, receptor ID, and trait label.
Choose how you will turn each protein sequence into an embedding and how you will summarize the output.
Set a simple baseline, such as sequence identity or k-mer counts, so you can compare against the embedding method.
Plan the test you will use, such as clustering, classification, or correlation, and decide which controls will challenge the signal.
Lock in your split between training and evaluation data before you run the model, so you can judge generalization honestly.

Common Pitfalls

Mixing orthologs and paralogs, which makes the model compare different gene histories instead of the same family across species.
Using incomplete protein records, which can make missing sequence regions look like adaptation.
Letting species labels carry hidden bias, such as diet and lineage changing together, which blurs the signal you want to test.
Skipping a baseline method, which leaves you unable to tell whether embeddings beat simple sequence similarity.
Showing one neat cluster plot without uncertainty checks, which can hide weak or unstable results.

What Makes This Competitive

A strong version of this project compares ESM-2 results against simple similarity scores and tests more than one olfactory receptor subfamily. It also checks whether the signal survives when you hold out species from the same clade, so the model cannot just memorize family relatedness. If you add careful controls for gene length, duplication, and phylogeny, you get a much stronger research story.

Project Variations

Compare mammal species with different diets to see whether feeding strategy tracks olfactory receptor embeddings.
Repeat the same analysis on another sensory gene family, such as taste receptors, to see whether the signal is specific to smell.
Test one clade at a time, such as bats or rodents, to see whether specialization inside a lineage changes the pattern.

Learn More

NCBI Protein: Search protein sequences, annotations, and species records in the NCBI database.
UniProt: Find curated protein entries and family notes by searching UniProt.
PubMed: Search review articles on olfactory receptors, protein language models, and molecular evolution.
NCBI BLAST: Compare receptor sequences and identify close homologs using the BLAST tool.
MIT OpenCourseWare: Look for free courses in genetics, bioinformatics, and machine learning.

Animal Sciences Category Guide

How to Do Real Animal Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →