Synthetic EHR Cohorts with TabDDPM

Synthetic EHR Cohorts with TabDDPM

ISEF Category: Biomedical and Health Sciences

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A tiny EHR dataset can leak a patient’s identity if you copy it the wrong way. Synthetic records try to keep the patterns without keeping the people. Tabular diffusion models are one way to do that. Your project can test whether the fake data still trains useful models and still protects privacy.

What Is It?

This project studies how a tabular diffusion model, or TabDDPM, can generate synthetic electronic health record data for rare diseases. Synthetic data is fake data built from real patterns. In this case, the model learns from de-identified patient tables in MIMIC, then makes new rows that should look like real patients without copying any one person exactly.

Think of it like learning the grammar of a spreadsheet. The model does not memorize every sentence. Instead, it learns which lab values, diagnoses, medications, and outcomes tend to show up together. Downstream-task fidelity means you check whether a model trained on synthetic data still works well on real data. Re-identification risk means you test whether someone could match a synthetic row back to a real patient.

Why This Is a Good Topic

This is a strong science fair topic because you can measure both usefulness and privacy. That gives you a real tradeoff to study, not just a yes-or-no result. It also connects to a real problem in health research, since rare-disease cohorts are small, sensitive, and hard to share. You can learn data cleaning, feature engineering, model evaluation, and privacy testing in one project.

Research Questions

  • How does cohort size affect downstream-task fidelity of TabDDPM-generated rare-disease EHR data?
  • What is the effect of class imbalance on the privacy-utility tradeoff in synthetic EHR cohorts?
  • Does conditioning TabDDPM on disease label improve rare-disease feature coverage compared with unconditional generation?
  • To what extent does synthetic data preserve correlations among lab values, diagnoses, and medications?
  • Which privacy metric, nearest-neighbor distance or membership inference AUC, is more sensitive to re-identification risk?
  • How does removing direct identifiers and quasi-identifiers before training change fidelity and privacy scores?

Basic Materials

  • Laptop or desktop computer with at least 16 GB RAM.
  • Reliable internet connection and a PhysioNet account.
  • Python 3.11 environment with Jupyter Notebook.
  • Git and a GitHub account for code tracking.
  • De-identified EHR cohort exported as CSV or parquet files.
  • Secure storage for patient-level data files.

Advanced Materials

  • GPU workstation or cloud GPU instance with 8 GB to 16 GB of VRAM.
  • PyTorch-compatible CUDA setup.
  • Access-approved MIMIC-IV or similar de-identified EHR dataset.
  • Secure research workspace with access logging.
  • Statistical package for privacy and utility metrics.
  • Scripted feature-engineering pipeline for cohort building and evaluation.

Software & Tools

  • Python: Runs data cleaning, model training, and evaluation scripts.
  • Jupyter Notebook: Lets you inspect cohorts, plots, and metric tables step by step.
  • PyTorch: Trains the tabular diffusion model and related baselines.
  • pandas: Cleans, joins, and reshapes EHR tables into model-ready features.
  • scikit-learn: Fits downstream classifiers and computes validation metrics.

Experiment Steps

  1. Define the rare-disease cohort and the exact prediction task you want to test.
  2. Decide which patient features to keep, drop, and encode before training.
  3. Build one simple baseline generator so you can compare TabDDPM against something easier.
  4. Plan a scorecard that measures fidelity, privacy, and subgroup balance side by side.
  5. Split real data carefully so no patient appears in both training and evaluation sets.
  6. Add one ablation that changes only a single modeling choice at a time.

Common Pitfalls

  • Mixing train and test patients, which makes downstream scores look better than they really are.
  • Keeping ultra-rare categorical codes with too few examples, which can push the model toward memorization.
  • Judging privacy only by reconstruction error, which can miss membership inference and nearest-neighbor leakage.
  • Comparing TabDDPM to the wrong baseline, which hides whether the diffusion model adds value.
  • Encoding dates or patient IDs too directly, which leaks identity and distorts the feature space.

What Makes This Competitive

A strong version goes beyond asking whether the synthetic table looks real. You can compare multiple privacy tests, stress the model on the smallest disease subgroups, and report where fidelity breaks first. You can also separate prediction utility from subgroup analysis, since those can tell different stories. That kind of careful evaluation makes the project read like research, not a demo.

Project Variations

  • Swap the rare-disease cohort for another low-prevalence condition, then compare whether TabDDPM preserves the same feature patterns.
  • Compare TabDDPM with a variational autoencoder, CTGAN, or a simple bootstrap baseline to see which generator balances fidelity and privacy best.
  • Focus on one privacy test, such as membership inference, and study how feature selection changes exposure risk.

Learn More

  • PhysioNet MIMIC-IV: De-identified ICU EHR data and documentation, found on PhysioNet.
  • MIMIC Code Repository: Cohort-building examples and SQL patterns, found on GitHub and PhysioNet.
  • PubMed: Search for review articles on synthetic health data, privacy, and EHR generation.
  • NIH Office of Data Science Strategy: Background on responsible data sharing and privacy, found on NIH sites.
  • arXiv: Search for TabDDPM and tabular diffusion model papers.
  • scikit-learn User Guide: Free reference for model evaluation and metrics, found in the official scikit-learn docs.
Shopping Cart