NAFLD Detection From Bloodwork With Machine Learning
ISEF Category: Biomedical and Health Sciences
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Pathophysiology · Difficulty: Advanced · Setup: Home Setup · Time: 1 to 2 Months
The Hook
A routine blood panel can hide more than you think. The same numbers doctors already order may point to fatty liver disease before symptoms show up. That makes this a strong science fair idea, because you can test whether simple bloodwork can help flag risk early. You also get to work with real public health data, not toy examples.
What Is It?
Non-alcoholic fatty liver disease, or NAFLD, happens when extra fat builds up in the liver in people who do not drink enough alcohol to explain it. Many people do not notice it early. Blood tests can show clues, like changes in liver enzymes, glucose, lipids, and other markers, but no single value gives the whole answer.
This project asks whether a machine learning model can combine those clues better than one lab value alone. Think of it like a referee reading several signals at once instead of judging by one score. A boosted-tree model is a type of algorithm that builds many small decision rules and combines them into one prediction. If you keep the model interpretable, you can also see which blood markers matter most and how much they matter.
Why This Is a Good Topic
This topic works well because public datasets like NHANES let you study a real health question without needing a hospital lab. You can test a clear prediction task, compare models, and measure performance with metrics that matter in medicine. It connects to early disease screening, which is a real problem, and it teaches you data cleaning, feature selection, validation, and interpretation. A student can produce a serious project here with careful analysis and good study design.
Research Questions
- How does adding liver enzyme markers change the model's ability to detect NAFLD from routine bloodwork?
- What is the effect of using a boosted-tree model instead of logistic regression on discrimination and calibration?
- Does external validation on a held-out NHANES year reduce performance compared with internal cross-validation?
- To what extent do glucose, triglycerides, and BMI improve prediction beyond liver enzymes alone?
- Which bloodwork features contribute most to model predictions under an interpretable boosting approach?
- What is the effect of class imbalance handling on sensitivity, specificity, and precision for NAFLD detection?
- To what extent does decision-curve analysis show that the model offers net benefit over simple screening rules?
Basic Materials
- Laptop with enough memory to handle a public health dataset.
- Python installed with pandas, scikit-learn, xgboost or lightgbm, and statsmodels.
- NHANES public data files and codebook documentation.
- Spreadsheet software for quick checks and annotation.
- Internet access for downloading data dictionaries, review articles, and methods papers.
- Text editor or notebook environment such as Jupyter Notebook or VS Code.
Advanced Materials
- Laptop or workstation with Python, R, or both for parallel analysis.
- NHANES linked demographic, laboratory, and exam datasets.
- Jupyter Notebook or RStudio for reproducible analysis.
- shap or a similar package for feature attribution.
- Calibration and decision-curve analysis packages in Python or R.
- External validation dataset from a different survey wave or a separate public cohort, if available.
- Version control with Git for tracking model changes.
Software & Tools
- Python: Runs data cleaning, feature engineering, model training, and evaluation.
- Jupyter Notebook: Keeps code, charts, and notes in one reproducible place.
- pandas: Cleans NHANES tables and merges survey files.
- scikit-learn: Splits data, trains baselines, and scores model performance.
- SHAP: Explains which variables push predictions toward or away from NAFLD.
Experiment Steps
- Define the exact outcome you will predict and the NHANES years you will use.
- Choose a small set of routine bloodwork features that a real clinic could access easily.
- Build a clean analysis table and decide how you will handle missing values, outliers, and class imbalance.
- Train a simple baseline first, then compare it with your boosted-tree model.
- Plan an external validation split by year and decide which metrics will count most.
- Add interpretability and decision-curve analysis so your model answers a clinical question, not just a coding question.
Common Pitfalls
- Using too many predictors, which can make the model look strong in training and weak on held-out data.
- Mixing survey years during preprocessing, which can leak information across the validation split.
- Ignoring missing bloodwork values, which can bias the sample toward healthier participants.
- Reporting only accuracy, which can hide poor sensitivity for a screening problem.
- Skipping calibration and decision-curve analysis, which leaves you without a real-world view of usefulness.
What Makes This Competitive
A stronger version of this project goes beyond raw accuracy. You would compare a simple clinical baseline, test external validation on a separate year, and report calibration, not just discrimination. You could also show which blood markers add real predictive value and whether the model gives net benefit across useful thresholds. That kind of analysis looks much closer to a real screening study than a standard class project.
Project Variations
- Use a different liver outcome, such as fibrosis risk, and compare whether routine bloodwork predicts it better or worse.
- Replace the boosted-tree model with a sparse logistic model and test whether simplicity costs much performance.
- Compare prediction using bloodwork alone versus bloodwork plus demographic features such as age, sex, and BMI.
Learn More
- NHANES: The National Health and Nutrition Examination Survey provides the public data files, codebooks, and laboratory measures you need.
- CDC NHANES Analytic Guidelines: Explains weighting, missing data, and how to analyze survey data correctly.
- NIH PubMed: Search for review articles on NAFLD screening, routine blood biomarkers, and machine learning in medicine.
- Nature Medicine and Hepatology: Search recent articles on NAFLD prediction, validation, and biomarker studies.
- scikit-learn Documentation: Covers model building, cross-validation, calibration, and evaluation methods.
- MIT OpenCourseWare: Search for free courses on machine learning, statistics, or data analysis if you want background on the methods.
Biomedical and Health Sciences Category Guide
How to Do Real Biomedical and Health Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Hub →
