Synthetic EHRs for Rare Disease Research
ISEF Category: Translational Medical Science
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point.But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
Rare disease data can be locked away even when the science is public. Synthetic EHRs try to solve that by making fake records that still behave like real ones. Think of them as training wheels for clinical research. Your job is to see whether the fake data keeps the patterns while hiding the people behind them.
What Is It?
Electronic health records, or EHRs, are digital charts from clinics and hospitals. They can include diagnoses, lab values, medications, and visit history. For rare diseases, those records can be hard to access because privacy rules are strict and the patient count is small. That makes the data useful, but hard to study.
Synthetic-data models try to make new records that follow the same patterns as the original data without copying any one patient. CTGAN and TabDDPM are two model families that can do this for table data, which means data arranged in rows and columns. CTGAN stands for conditional tabular generative adversarial network. TabDDPM is a diffusion-based model for tabular data. You do not need to build the math from scratch. You need to compare how well each model mimics the real data and how well it protects privacy.
This project sits at the intersection of AI and medicine. You are asking a practical question, not just a coding question. Can synthetic records help students and mentors explore rare-disease trends without opening real patient charts?
Why This Is a Good Topic
This is a strong science fair topic because you can measure both usefulness and privacy. You can test whether synthetic data keeps age, diagnosis, lab, or outcome patterns close to the source data, then check whether the model accidentally copies records too closely. That makes the project real, measurable, and tied to a serious problem in medicine. You can also learn data cleaning, model comparison, validation, and statistical thinking.
Research Questions
- How does CTGAN compare with TabDDPM in preserving age, sex, diagnosis, and outcome distributions in rare-disease EHR data?
- What is the effect of sample size on the utility of synthetic rare-disease records?
- Does adding more feature engineering improve the match between synthetic and real EHR correlations?
- To what extent do synthetic records preserve subgroup patterns for patients with different disease severities?
- Which privacy metric best detects memorization in synthetic EHR data from rare-disease registries?
- How does the choice of evaluation metric change which synthetic model looks best?
- What is the effect of training on different rare-disease registries on downstream prediction performance?
Basic Materials
- Laptop or desktop computer with enough memory to run tabular machine learning models.
- Python installed through Anaconda or a similar free distribution.
- Public rare-disease registry data or de-identified EHR benchmark data.
- Spreadsheet software or a CSV editor for checking columns and missing values.
- External storage or cloud folder for saving model outputs and logs.
- Notebook for tracking preprocessing choices, metrics, and model settings.
- Digital calendar or planner for managing experiments and reruns.
Advanced Materials
- University workstation or GPU server for training larger generative models.
- Python environment with PyTorch, pandas, NumPy, scikit-learn, and model code for CTGAN and TabDDPM.
- Secure access to de-identified clinical tabular data approved by your mentor or institution.
- Statistical software for deeper validation, such as R or Python stats packages.
- Visualization tools for comparing feature distributions, correlations, and privacy metrics.
- Version control system such as Git for tracking code changes.
- Secure data access setup that follows your mentor's data governance rules.
Software & Tools
- Python: Runs your data cleaning, model training, and evaluation scripts.
- Jupyter Notebook: Helps you document tests, plots, and results in one place.
- pandas: Loads tabular data, checks missing values, and prepares features.
- scikit-learn: Computes metrics, splits data, and supports downstream prediction tests.
- SDV: Provides synthetic tabular data tools, including CTGAN workflows.
Experiment Steps
- Define the rare-disease dataset, the target fields, and the privacy questions you want to answer.
- Choose the feature set you will preserve, then decide which columns need encoding, scaling, or grouping.
- Set up a fair comparison between CTGAN and TabDDPM with the same train, validation, and test splits.
- Build evaluation rules for utility, including distribution match, correlation structure, and downstream prediction performance.
- Plan privacy checks that look for record memorization, nearest-neighbor overlap, or membership risk.
- Compare results across models and sample sizes, then decide which trade-off best fits your research goal.
Common Pitfalls
- Using a tiny rare-disease sample, which makes synthetic models memorize instead of learn patterns.
- Comparing models on raw accuracy alone, which can hide bad privacy leakage or weak distribution matching.
- Mixing categorical codes and true values without careful preprocessing, which breaks the model's ability to learn structure.
- Ignoring class imbalance in rare-disease outcomes, which makes the synthetic data look better than it really is.
- Checking privacy only by visual similarity, which misses direct record copying and nearest-neighbor collisions.
What Makes This Competitive
A strong project goes past basic model training. You would compare more than one rare-disease dataset, use several utility metrics, and test privacy with methods that look for memorization, not just visual similarity. You could also study which variables synthetic models preserve well, and which ones they distort. A thoughtful error analysis and clear statistical comparison would make the work feel research-level.
Project Variations
- Use pediatric rare-disease registry data instead of adult records, then compare whether model quality changes by age group.
- Add a downstream risk model, then test whether synthetic data supports the same prediction task as the original data.
- Compare CTGAN and TabDDPM on mixed data types, then focus on which model better preserves rare categorical diagnoses.
Learn More
- SDV Documentation: Search the SDV project docs for CTGAN and tabular synthetic data workflows.
- NIH Data Sharing and Privacy Resources: Search NIH guidance on de-identified data use and research privacy rules.
- PubMed: Search review articles on synthetic health data, rare disease registries, and privacy-preserving machine learning.
- Nature Scientific Data: Search for papers on synthetic medical data evaluation and benchmarking.
- MIT OpenCourseWare: Search for machine learning and probabilistic modeling courses that cover generative models and tabular data.
- CDC Rare Disease Information: Search CDC pages for background on rare disease surveillance and registry concepts.
Translational Medical Science pillar guide
How to Do Real Translational Medical Science Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →