Urine Biomarker Discovery for Diabetes

ISEF Category: Biochemistry

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Analytical Biochemistry · Difficulty: Advanced · Setup: Home Setup · Time: 1 to 2 Months

The Hook

A urine sample can hold a diabetes signal before blood sugar looks clearly abnormal. That makes urine metabolomics a strong clue-hunting project. If you can shrink thousands of molecules down to five or fewer that still predict disease, you build a model that is easier to explain and easier to test.

What Is It?

Metabolomics studies measure tiny molecules left behind by your body's chemistry. Think of each metabolite as a footprint from metabolism, the chemical work your cells do all day. In early type-2 diabetes, some of those footprints shift because the body handles sugar, fat, and energy in different ways.

Your project uses public studies from MetaboLights and asks which few urine metabolites best separate early type-2 diabetes from control samples. Sparse logistic regression is a model that keeps only a small set of useful predictors and pushes the rest toward zero, like trimming a huge keyring down to the keys that actually open the door. That makes the result easier to read and easier to compare across studies.

Why This Is a Good Topic

This is a strong science fair topic because you can test a clear question with public data, then measure how well your answer holds up on new samples. It connects to a real health problem, early diabetes screening, and it lets you learn data cleaning, feature selection, model validation, and biomarker thinking. You can do real research without a wet lab, which makes the project realistic for a first-time student.

Research Questions

Which urine metabolites stay predictive after you limit the model to five or fewer features?
How does the biomarker panel change when you train on one study and test on a different study?
What is the effect of different normalization choices on panel stability and model accuracy?
Does adding age, sex, or body mass index improve prediction beyond metabolites alone?
To what extent do selected metabolites remain stable across cross-validation folds?
How does class balance affect the false positive rate and the final biomarker list?

Basic Materials

Laptop with internet access.
Spreadsheet software or Google Sheets.
R or Python installed.
Access to MetaboLights study files and sample metadata.
PubMed access for background reading.

Advanced Materials

University workstation with enough memory for larger metabolomics tables.
R with glmnet, tidymodels, and Bioconductor.
Python with pandas, scikit-learn, and Jupyter Notebook.
Access to raw LC-MS feature tables or mzML-derived matrices.
Statistical plotting software for ROC and calibration curves.

Software & Tools

R: Fits sparse logistic regression with glmnet and checks cross-validation.
Python: Cleans metadata, merges tables, and runs comparison models.
Jupyter Notebook: Keeps code, notes, and figures in one place.
MetaboAnalyst: Helps compare metabolite patterns and visualize separation.
GitHub: Tracks analysis changes and keeps the project reproducible.

Experiment Steps

Define the exact diabetes and control groups you will compare, and set your inclusion rules before you touch the model.
Gather compatible MetaboLights studies, then map each metabolite name to one clean table.
Decide your normalization and filtering rules, then lock them before model training starts.
Build a sparse logistic regression workflow with cross-validation and a cap of five metabolites.
Test the panel on held-out samples or a second study, then compare accuracy, sensitivity, and specificity.
Check whether the selected metabolites stay similar across resamples, because stable features matter more than one lucky split.

Common Pitfalls

Mixing studies with different sample handling rules, which can make the model learn batch effects instead of diabetes signals.
Comparing metabolite names without matching IDs first, which can split one chemical into several fake features.
Filtering out too many low-abundance metabolites, which can erase the very signal you wanted to find.
Training and testing on the same samples, which inflates accuracy and hides overfitting.
Ignoring class imbalance, which can make a weak model look good by favoring the larger group.

What Makes This Competitive

A stronger version of this project does more than find a short list of metabolites. You would show that the panel survives different normalization choices, different train-test splits, and at least one held-out study. If you also report calibration, feature stability, and a comparison with a less sparse model, your work looks much closer to real biomarker research.

Project Variations

Compare urine biomarkers across early diabetes, prediabetes, and healthy controls to see whether the same panel still works.
Swap sparse logistic regression for random forest or support vector machines, then compare which model keeps the smallest useful feature set.
Test whether a panel built from one MetaboLights study still predicts samples from another study after you standardize the feature table.

Learn More

MetaboLights: Search public metabolomics studies, download metadata, and compare study designs in the EMBL-EBI archive.
NIH Metabolomics Workbench: Find public urine metabolomics datasets and protocol notes from the NIH resource.
PubMed: Search review articles on urine metabolomics, type-2 diabetes, and biomarker discovery.
glmnet documentation: Read about LASSO and elastic net models for sparse feature selection in R.
scikit-learn documentation: Learn cross-validation, logistic regression, and model scoring in Python.
MIT OpenCourseWare: Search statistics and machine learning course notes for model validation and regularization.

Biochemistry Category Guide

How to Do Real Biochemistry Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →