Lupus Transcriptome Age-of-Onset Classifier

ISEF Category: Biomedical and Health Sciences

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genetics and Molecular Biology of Disease · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Two patients can have the same diagnosis and still show different gene activity patterns. In lupus, age of onset often lines up with different immune signals, so a computer model may be able to tell pediatric cases from adult cases from blood transcriptomes alone. That gives you a clean yes-or-no question with public data and a clear biological story.

What Is It?

A transcriptome is a snapshot of which genes are turned up or down in a sample. Microarray and RNA-seq are two ways to measure that snapshot. You can build a classifier, which is a model that sorts samples into groups, and ask whether it can tell pediatric-onset lupus from adult-onset lupus.

SHAP stands for Shapley Additive Explanations. Think of it like a scorecard for each gene. It shows which genes pushed a sample toward the pediatric or adult group, so you can explain the model instead of just trusting a black box.

Why This Is a Good Topic

This is a strong science fair topic because the data are public, the question is narrow, and the results are measurable with accuracy, AUC, and SHAP scores. Lupus age of onset connects to real disease biology, so your work can speak to a real medical problem, not just a coding exercise. You can learn data cleaning, cross-validation, batch correction, and model interpretation without needing a wet lab.

Research Questions

How does feature selection by variance threshold change classifier accuracy for pediatric-onset versus adult-onset lupus?
What is the effect of training on microarray data alone versus RNA-seq data alone on cross-platform performance?
Does adding batch correction improve the model's ability to generalize across GEO studies?
To what extent do linear models versus tree-based models differ in AUC for age-of-onset classification?
Which genes remain top SHAP drivers across multiple resampling runs?
How does restricting features to immune pathway genes change interpretability and accuracy?

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Stable internet connection.
Free access to NCBI GEO public datasets.
Python environment with Jupyter Notebook or Google Colab.
Spreadsheet software for sample labels, metadata, and trackable notes.
External storage or cloud storage for downloaded matrices and results.

Advanced Materials

Workstation or server with 32 GB RAM or more.
R and Python environments on a research machine.
High-memory storage for repeated model runs and saved outputs.
Institutional access to controlled transcriptomic data, if available through a mentor or lab.
Git-based repository for version control and analysis logs.

Software & Tools

Python: Cleans the tables, trains classifiers, and builds evaluation plots.
R/Bioconductor: Imports GEO data, checks annotations, and helps with expression preprocessing.
Jupyter Notebook: Keeps code, notes, and figures in one place.
scikit-learn: Trains baseline and comparison models with cross-validation.
SHAP: Ranks genes by how much they push each prediction.

Experiment Steps

Define the exact age-of-onset label rule and decide whether you will pool platforms or analyze them separately.
Clean the public datasets, align gene identifiers, and plan how you will handle missing values and batch effects.
Choose one baseline classifier, then compare it with one or two stronger models using the same split strategy.
Build a validation plan that keeps studies or patients separated so the model cannot memorize source effects.
Run SHAP on the best model and check whether the top genes stay similar across resamples, studies, or platforms.

Common Pitfalls

Mixing samples from the same study across train and test splits, which makes the model look better than it is.
Comparing microarray probes and RNA-seq gene counts without matching gene IDs, which scrambles the feature table.
Ignoring batch effects between GEO studies, which turns lab source into the easiest signal.
Treating SHAP rankings from one random split as final, which hides unstable gene drivers.
Using too many low-sample features, which lets the classifier memorize noise instead of age-of-onset biology.

What Makes This Competitive

A stronger version does more than report one accuracy score. It tests whether the signal survives across studies, platforms, and resamples, then checks if the same genes keep rising to the top. If you compare multiple classifiers, handle batch effects carefully, and show stable SHAP results, your project starts to look like real research instead of a demo.

Project Variations

Which genes separate pediatric-onset and adult-onset lupus when you analyze only whole-blood RNA-seq datasets?
What is the effect of using microarray data from peripheral blood mononuclear cells on classifier performance?
Does a pathway-level model based on interferon or B-cell genes give clearer SHAP explanations than a gene-by-gene model?

Learn More

NCBI Gene Expression Omnibus: Search for lupus transcriptomic studies and download public expression matrices.
PubMed: Search review articles on pediatric lupus transcriptomics, age of onset, and interferon signatures.
NIH Gene: Look up gene summaries and related pathways for top SHAP genes.
MIT OpenCourseWare: Review free machine learning lectures on classification, overfitting, and validation.
scikit-learn User Guide: Read the free documentation for model training, cross-validation, and metrics.
Broad Institute MSigDB: Find curated gene sets for optional pathway checks and enrichment analysis.

Biomedical and Health Sciences Category Guide

How to Do Real Biomedical and Health Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →