Finding Micropeptides in lncRNAs

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genomics · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

Your genome may hide tiny protein-coding genes in places scientists once called noncoding. Some of those hidden genes can make micropeptides, which are short proteins with real biological roles. You can search for them with public data and prediction tools. That means you can do real gene discovery from your laptop.

What Is It?

Long noncoding RNAs, or lncRNAs, are RNA molecules that were first thought to not make proteins. Small open reading frames, or smORFs, are short stretches of RNA that can be translated into tiny proteins called micropeptides. Think of an lncRNA like a long instruction manual with small secret notes buried inside it. Ribosome profiling can reveal where ribosomes, the cell’s protein-making machines, sit on an RNA, which gives clues that translation is happening.

This project asks whether some lncRNAs contain real smORFs that are missed by standard gene annotations. You can reanalyze public ribosome-profiling data from GWIPS-viz, then use AlphaFold or related structure prediction to see whether the candidate micropeptides look stable enough to be interesting. The goal is not just to find short sequences. The goal is to rank candidates that have evidence of translation and a plausible protein shape.

Why This Is a Good Topic

This is a strong science fair topic because it starts with public data, yet still leaves room for original discovery. You can test clear rules for calling a candidate smORF, compare different filters, and ask whether structure prediction helps prioritize the best hits. The topic connects to hidden gene discovery, genome annotation, and the search for new biomarkers or cell functions. You can learn real bioinformatics skills, from data mining to sequence analysis to evidence-based ranking.

Research Questions

How does the ribosome footprint signal differ between candidate lncRNA smORFs and matched noncoding regions?
What is the effect of smORF length on the likelihood that ribosome profiling supports translation?
Does conservation across species increase the chance that an lncRNA smORF is a credible micropeptide candidate?
To what extent do AlphaFold confidence scores help rank translated smORFs by predicted structural stability?
Which lncRNAs contain smORFs that pass both ribosome-profiling evidence and protein-folding plausibility checks?
How does the choice of annotation filter change the number of candidate micropeptides recovered from public datasets?
What is the effect of start-codon context on the strength of smORF candidate evidence?

Basic Materials

Computer with stable internet access.
Web browser with access to GWIPS-viz.
Spreadsheet software for tracking candidates.
Sequence browser or annotation viewer for checking genomic context.
Python installed through Anaconda or a similar free distribution.
Jupyter Notebook for organizing analysis.
FASTA files for candidate RNA and peptide sequences.
Basic file storage for downloaded public datasets.
PubMed access for reading review papers and primary studies.

Advanced Materials

Computer with enough memory to run batch sequence analysis.
Local Python environment with Biopython and pandas.
AlphaFold or ColabFold access for structure prediction runs.
Multiple sequence alignment software for conservation checks.
Ribosome profiling datasets from GEO or GWIPS-viz.
Genome annotation files in GTF or GFF format.
BLAST tools for similarity checks against known proteins.
R or Python packages for statistical testing and visualization.

Software & Tools

GWIPS-viz: Lets you inspect public ribosome-profiling tracks to spot translation signals across transcripts.
AlphaFold: Predicts protein structure for candidate micropeptides so you can judge whether a hit looks physically plausible.
ColabFold: Provides a more accessible way to run structure prediction on short peptides and compare candidates.
Python: Helps you automate candidate filtering, scoring, and ranking across many transcripts.
IGV: Lets you visually check read coverage and transcript context for shortlisted smORFs.

Experiment Steps

Define your discovery rule set for what counts as a candidate smORF, including evidence from ribosome profiling, coding potential, and peptide length.
Collect a curated list of lncRNAs and matched control regions so you can compare translated and nontranslated sequences fairly.
Build a ranking pipeline that scores candidates by ribosome footprint support, start-codon context, and conservation across species.
Predict peptide structure for your top candidates and decide how you will use confidence scores to separate likely real micropeptides from weak hits.
Compare your candidates against known annotations so you can tell whether your pipeline finds established micropeptides and novel leads.
Plan a validation path that suggests which candidates deserve follow-up by a wet-lab collaborator or a future study.

Common Pitfalls

Treating every ribosome-profiling peak as proof of a real protein, which can confuse noise with translation.
Using lncRNA annotations that already include hidden coding genes, which can make your discovery claims unreliable.
Ignoring transcript isoforms, which can place the same footprint signal on the wrong exon or reading frame.
Ranking candidates by AlphaFold confidence alone, which can favor structure predictions without translation evidence.
Forgetting to compare against random or matched control sequences, which makes your pipeline look better than it really is.

What Makes This Competitive

A stronger project would go beyond a simple candidate list. You would define a clear scoring model, test it on known translated smORFs, and compare it against matched controls. You could also ask whether adding structure confidence improves ranking more than sequence conservation alone. That kind of layered analysis feels much closer to real genome annotation work.

Project Variations

Use human disease-linked lncRNAs instead of all lncRNAs to see whether candidate micropeptides cluster in medically relevant genes.
Focus on conserved smORFs across vertebrates to test whether evolutionary support improves prioritization.
Compare ribosome-profiling evidence with proteomics databases to see which candidates also have peptide-level support.

Learn More

PubMed: Search review articles on lncRNA translation, micropeptides, and ribosome profiling to build background knowledge.
NCBI GEO: Find ribosome-profiling datasets you can reanalyze from public studies.
GWIPS-viz: Explore genome-wide ribosome profiling tracks and learn how translation signals appear in different transcripts.
NIH NCBI Bookshelf: Read free textbook chapters on gene expression, translation, and RNA biology.
MIT OpenCourseWare: Search for free molecular biology and bioinformatics course materials that explain sequence analysis and genome annotation.
AlphaFold Protein Structure Database: Review example protein predictions and learn how structure confidence is reported.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →