Auditing Bias in Clinical Prediction Models

Auditing Bias in Clinical Prediction Models

ISEF Category: Biomedical and Health Sciences

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other  ·  Difficulty: Advanced  ·  Setup: Home Setup  ·  Time: 1 to 2 Months

The Hook

A model can look accurate overall and still hurt one group more than another. That is the trap in clinical prediction tools, especially when a race or sex variable changes the score. You can test that problem with public health data and show where the numbers break. Then you can try to fix the calibration.

What Is It?

Clinical prediction models turn patient data into a risk score or treatment signal. Doctors use them to estimate things like kidney function, liver severity, or heart disease risk. The model is like a shortcut rule, but a shortcut only works well if it fits the people being measured.

Bias shows up when the model performs better for one group than another. One group may get scores that are too high, too low, or less well ordered by risk. Reweighting means you adjust the sample so the comparison group better matches the real population. Recalibration means you update the score mapping so predicted risk lines up more closely with actual outcomes.

Why This Is a Good Topic

This is a strong science fair topic because you can test a real medical fairness question with public data, clear metrics, and repeatable analysis. You can compare calibration, error rates, and subgroup performance without needing a hospital lab. The topic connects directly to patient care, and you can learn how data design, model choice, and fairness trade off against one another.

Research Questions

  • How does reweighting NHANES samples change calibration error across racial and gender groups?
  • What is the effect of removing race coefficients on prediction error for each subgroup?
  • Does debiased recalibration improve subgroup calibration without hurting overall discrimination?
  • To what extent do different fairness metrics agree or disagree for the same clinical model?
  • Which model correction, reweighting or recalibration, reduces subgroup bias more consistently?
  • How does subgroup sample size affect the stability of fairness estimates?

Basic Materials

  • Laptop with Python support and at least 8 GB RAM.
  • Internet access for downloading public datasets and code.
  • NHANES public-use data files from CDC.
  • Published model equations or scoring rules for the clinical predictor you choose.
  • Spreadsheet software for tracking variables and results.
  • Digital notebook for documenting inclusion rules, subgroup definitions, and outputs.

Advanced Materials

  • Laptop or workstation with 16 GB RAM or more.
  • Python environment with data-analysis libraries such as pandas, numpy, scipy, and statsmodels.
  • NHANES survey design files and weight variables.
  • PubMed abstracts or full papers for the original model and validation studies.
  • Package for fairness and calibration analysis, such as scikit-learn plus custom metrics.
  • Version control repository for code and result tracking.

Software & Tools

  • Python: Handles data cleaning, subgroup analysis, and recalibration.
  • Pandas: Organizes NHANES tables and merges survey variables.
  • scikit-learn: Computes prediction metrics, calibration curves, and model comparisons.
  • Statsmodels: Supports weighted regression and survey-style analysis.
  • Jupyter Notebook: Keeps code, notes, and plots in one shareable file.

Experiment Steps

  1. Choose one clinical model and define the exact outcome you will audit.
  2. Build a clean NHANES analysis set with the variables the model needs.
  3. Split the data into subgroup slices so you can compare performance by race and gender.
  4. Decide which fairness and calibration metrics will answer your question best.
  5. Test a baseline score, then compare it with reweighted and recalibrated versions.
  6. Summarize whether the fix improves equity, accuracy, or both.

Common Pitfalls

  • Using a model equation that does not match the data fields you actually have, which makes the audit invalid.
  • Comparing groups with tiny sample sizes, which makes the fairness results jump around.
  • Mixing survey weights and raw counts, which can distort the subgroup estimates.
  • Judging bias with only one metric, which can hide a tradeoff in calibration or discrimination.
  • Recalibrating on the same data you use for evaluation, which makes the improvement look better than it really is.

What Makes This Competitive

A strong version of this project goes past a simple bias check. You compare several correction methods, test them on more than one subgroup, and report both calibration and discrimination. You also show whether the fix helps one group while hurting another. That kind of careful analysis makes the project much stronger than a basic fairness demo.

Project Variations

  • Audit one model at a time, such as eGFR, ASCVD, or MELD, and compare how bias patterns differ.
  • Replace race-based correction with sex-based or age-based subgroup analysis to test whether the same fairness pattern appears elsewhere.
  • Compare reweighting with post-hoc recalibration and report which method keeps overall accuracy while improving subgroup fit.

Learn More

  • NHANES, CDC National Center for Health Statistics: Search the NHANES data and documentation pages for public survey files, weights, and codebooks.
  • PubMed: Search for review articles on clinical prediction bias, model calibration, and subgroup fairness.
  • NIH Office of Data Science Strategy: Find background on responsible data use and biomedical machine learning.
  • scikit-learn documentation: Read the model evaluation and calibration sections for free examples and metric definitions.
  • Statsmodels documentation: Find weighted regression and survey-related analysis tools for public health data.
  • NEJM and JAMA articles: Search for editorials and validation studies on race corrections in clinical scoring systems.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Hub →

Shopping Cart