Language Models for Cytochrome C Evolution

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Computational Evolutionary Biology · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A protein sequence is like a sentence, and one wrong letter can change the meaning. Your model tries to read millions of those sentences and guess which new ones still work. If it succeeds, you get a window into how evolution explores protein space. You also get a strong project for computational biology.

What Is It?

Cytochrome c is a small protein found in many organisms. Cells use it to move electrons during energy production. Its sequence has changed across evolution, but many changes still keep the protein functional. That makes it a good test case for a model that learns the patterns behind working protein sequences.

Your project asks a computer model to learn those patterns from lots of cytochrome c sequences, then score new variants that do not appear in your training set. Think of it like training a grammar model on thousands of sentences, then asking which new sentences still sound right. In this case, the "grammar" is the set of amino acid rules that help a protein fold and function. You then compare the model's scores with experimentally tested variants from public databases.

Why This Is a Good Topic

This is a strong science fair topic because you can turn sequence data into measurable predictions and test them against known outcomes. The project connects to protein design, evolution, and disease research, since the same logic helps scientists study how mutations affect function. You can build it with public data and open-source tools, which makes it realistic without a wet lab. A student can learn data cleaning, model evaluation, and statistical comparison while still doing original research.

Research Questions

How does a small language model's ranking of cytochrome c variants compare with known experimental function scores?
What is the effect of training set size on the model's ability to identify functional variants?
Does removing closely related sequences from training change how well the model generalizes to unseen species?
To what extent do model scores match experimental stability or activity measurements for cytochrome c mutants?
Which mutation positions does the model treat as most constrained, and do those positions overlap with known functional sites?
How does a language model compare with a simpler conservation-based baseline for predicting cytochrome c function?

Basic Materials

Computer with enough memory to handle large sequence files.
Stable internet access for downloading public sequence and variant datasets.
Python installed with Jupyter Notebook.
NumPy and pandas for data cleaning and analysis.
Biopython for sequence handling.
scikit-learn for baseline models and evaluation.
A spreadsheet program for quick review of metadata.
Git for version control and reproducibility.

Advanced Materials

Access to a university or cloud GPU for training a small sequence model.
Large public protein sequence database downloads.
Curated experimental cytochrome c variant datasets from peer-reviewed sources.
PyTorch or TensorFlow for model training.
FASTA processing and alignment tools.
Sequence embedding or transformer model libraries.
Statistical analysis software such as R or Python stats packages.
A workflow manager such as Snakemake or Nextflow for reproducible pipelines.

Software & Tools

Python: Cleans sequence data, trains baseline models, and runs evaluation code.
Jupyter Notebook: Lets you document analysis, plots, and results in one place.
Biopython: Parses FASTA files and helps you manipulate protein sequences.
PyTorch: Trains a small language model on cytochrome c sequences.
scikit-learn: Builds simple comparison models and calculates performance metrics.
ImageJ: Not used for this topic, so skip it unless you analyze figure exports manually.

Experiment Steps

Define your prediction target, such as variant function, activity, or stability, and decide how you will score success.
Assemble a clean training set of cytochrome c sequences and remove duplicates, bad metadata, and obvious outliers.
Choose a baseline method first, so you can compare the language model against a simpler rule set.
Design your validation plan around experimentally characterized variants from public datasets that the model never sees during training.
Decide which evaluation metrics will matter most, such as ranking accuracy, correlation, or top-k recovery.
Plan a separate analysis for mutation positions, so you can see whether the model learns biologically meaningful constraints.

Common Pitfalls

Mixing protein sequences from different cytochrome c families, which can teach the model the wrong evolutionary patterns.
Using experimental records with inconsistent labels, which makes model performance look better or worse than it really is.
Letting near-identical sequences appear in both training and test sets, which causes data leakage and inflated scores.
Judging the model only by overall accuracy, which can hide poor ranking of rare but important functional variants.
Skipping a simple baseline, which makes it hard to tell whether the language model adds real value.

What Makes This Competitive

A competitive version of this project goes past "the model worked" and asks where, when, and why it works. You can strengthen it by using strict train-test splits, a conservation baseline, and a separate family-level holdout test. You can also compare several model sizes or training sets to see how prediction quality changes. The strongest projects make a clear biological claim, not just a computer science claim.

Project Variations

Use cytochrome c from one taxonomic group, such as bacteria, fungi, or animals, to test whether evolutionary constraints change across lineages.
Swap the language model for a multiple sequence alignment or hidden Markov model baseline and compare prediction quality.
Focus on mutation effect prediction at conserved active-site positions instead of whole-protein variant scoring.

Learn More

NCBI Protein Database: Search for cytochrome c sequences, annotations, and linked records for public data gathering.
UniProt: Find curated protein function records and sequence metadata for cross-checking labels.
PubMed: Search review articles on protein language models, cytochrome c, and variant effect prediction.
MIT OpenCourseWare: Use free machine learning and computational biology course materials to build background on modeling.
NCBI Bookshelf: Read free textbook chapters on molecular evolution, proteins, and sequence analysis.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →