Multi-Omics VAE for Metabolic Stress

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

Your cells can leave clues in two places at once, RNA and metabolites. RNA shows what genes are active, while metabolites show the chemical leftovers of cell activity. If you connect both, you may find a hidden pattern that single-data studies miss. That is the core idea behind multi-omics analysis.

What Is It?

Multi-omics means studying more than one kind of biological data from the same system. In this project, you look at transcriptome data, which is a snapshot of gene activity, and metabolome data, which is a snapshot of small molecules made by cells. Think of transcriptome data as the recipe book, and metabolome data as the finished dish. One shows what the cell plans to do, and the other shows what it actually produced.

A variational autoencoder, or VAE, is a machine learning model that compresses complex data into a smaller set of hidden features called latent variables. You can think of it like folding a huge map into a few important landmarks. If your model learns a shared latent axis linked to metabolic stress, then samples from different cohorts may cluster by biology instead of by dataset source. That is useful when you want to compare conditions like NAFLD, T2D, and aging across public databases.

Why This Is a Good Topic

This is a strong science fair topic because the question is testable, data-rich, and open-ended. Public datasets from GEO and MetaboLights let you work without collecting human samples, which keeps the project realistic for a student. You can ask whether a VAE finds shared biological structure across diseases, and you can test that with clear model outputs, clustering, and statistics. You will also learn data cleaning, feature scaling, model evaluation, and how to compare signals across different omics layers.

Research Questions

How does combining transcriptome and metabolome data change the separation of NAFLD, T2D, and aging samples compared with using either data type alone?
What is the effect of using a variational autoencoder on the ability to detect a shared latent metabolic-stress axis across cohorts?
Does adding matched metabolome features improve classification of disease state versus age state in public datasets?
To what extent do latent variables learned from one cohort transfer to another cohort from a different study?
Which preprocessing choice, such as log scaling or batch correction, gives the most stable latent structure across datasets?
What is the effect of removing low-variance features on the clarity of cohort clustering in the latent space?
To what extent do transcriptome and metabolome features point to the same stress-related pathways in downstream enrichment analysis?

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Stable internet access for downloading public datasets.
External hard drive or cloud storage for raw and processed files.
Spreadsheet software for tracking samples, cohorts, and metadata.
Python installed with Jupyter Notebook.
Public datasets from GEO and MetaboLights.
Basic statistics reference sheet or notes from an introductory online course.
Text editor for cleaning metadata and sample labels.

Advanced Materials

Desktop workstation or university cluster access with a GPU optional but helpful.
Python environment with PyTorch, scikit-learn, pandas, numpy, and scanpy.
R with Bioconductor packages for omics preprocessing and enrichment analysis.
JupyterLab or VS Code for reproducible notebooks.
ImageJ or similar only if you add a visualization-based validation step.
Access to pathway databases such as KEGG or Reactome through public tools.
Public transcriptome and metabolome cohorts with matched or near-matched samples.
Version control repository for code and analysis logs.

Software & Tools

Python: Cleans data, trains the variational autoencoder, and runs evaluation scripts.
Jupyter Notebook: Keeps code, notes, plots, and model checks in one place.
RStudio: Helps with statistical tests, batch correction, and enrichment analysis in R.
GEOquery: Pulls transcriptome metadata and expression matrices from GEO.
MetaboAnalyst: Supports metabolomics preprocessing, normalization, and pathway exploration.

Experiment Steps

Define the exact biological question you want the model to answer, such as shared stress structure across disease and aging cohorts.
Choose which public datasets qualify as comparable, then inspect their metadata for sample type, platform, and pairing quality.
Decide how you will clean, normalize, and align transcriptome and metabolome features before modeling.
Build a baseline model first, then compare it with the VAE so you can prove the autoencoder adds value.
Plan how you will test whether the latent space reflects biology, not just study source or batch effects.
Select downstream analyses that turn hidden variables into interpretable results, such as clustering, pathway enrichment, and cohort transfer tests.

Common Pitfalls

Mixing datasets with different sample types, which can make the model learn tissue differences instead of metabolic stress.
Skipping batch correction, which often causes the latent space to cluster by study source rather than biology.
Using too few paired samples, which makes the VAE unstable and hard to interpret.
Treating all omics features as equally reliable, which can let noisy metabolites or low-quality genes dominate the model.
Failing to keep the train and test cohorts separate, which inflates performance and hides poor generalization.

What Makes This Competitive

A strong project here does more than run a model. You need careful cohort matching, clear baseline comparisons, and a real test of whether the latent axis transfers across studies. Strong entries also separate biology from batch effects and show why the learned features mean something in disease terms. If you add pathway analysis, transfer testing, and a sensitivity check across preprocessing choices, your project starts to look much more like serious research.

Project Variations

Use only NAFLD and T2D cohorts to test whether the same latent axis appears in two related metabolic diseases.
Swap the VAE for a simpler autoencoder or PCA so you can compare whether nonlinear compression adds real value.
Add pathway-level features instead of raw omics features to test whether the latent axis becomes easier to interpret.

Learn More

NCBI GEO: Search for transcriptome datasets, metadata, and paired studies relevant to disease and aging.
MetaboLights: Find public metabolomics datasets and sample annotations for multi-omics comparison.
NIH PubMed: Search review articles on multi-omics integration, variational autoencoders, NAFLD, T2D, and aging.
MIT OpenCourseWare Computational Biology: Use free lectures to review machine learning, biological data analysis, and model evaluation.
Bioconductor: Read package vignettes for transcriptomics preprocessing, normalization, and enrichment analysis.
Nature Methods and PLOS Computational Biology: Search for peer-reviewed examples of multi-omics integration and latent-variable modeling.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →