Polygenic Risk Score Portability Across Ancestries

ISEF Category: Cellular and Molecular Biology

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genetics · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A DNA test can look precise and still miss the mark for many people. That happens when a model learns from one ancestry group and then gets used on others. You can measure that gap with public genetic data and real prediction scores.

What Is It?

A polygenic risk score, or PRS, adds up the small effects of many genetic variants to estimate a trait or disease risk. Think of it like a recipe with hundreds of tiny ingredients. Each ingredient matters a little, and the final score tries to predict the outcome.

The problem is that many PRS models come from discovery cohorts that are mostly European. If you build a model on one population, then test it on another, the score can lose accuracy. That drop is called poor portability. Your project asks how big that drop is, and whether a different modeling approach changes it.

A transformer-based PRS uses a machine learning model that can learn patterns among variants, not just add them one by one. That gives you a way to compare a newer method with standard PRS methods. You are not changing the biology. You are testing how well the prediction method travels across ancestry groups.

Why This Is a Good Topic

This topic works well for a science fair because you can use public summary statistics and public genotype resources, so you do not need to collect human samples yourself. You can test a clear question, measure accuracy with standard metrics, and compare groups. The project connects to a real problem in health equity, since prediction tools can work better for some populations than others. You also learn data cleaning, model evaluation, and bias analysis, which are useful skills for genetics research.

Research Questions

How does PRS prediction accuracy change when the same model is tested across different ancestry groups??
What is the effect of using a European-only discovery cohort on PRS portability in non-European samples??
Does a transformer-based PRS reduce ancestry-related prediction loss compared with a standard additive PRS??
To what extent do p-value threshold choices change cross-ancestry PRS performance??
Which ancestry group shows the largest drop in explained variance when PRS is transferred from the training cohort??
How does linkage disequilibrium mismatch affect PRS portability across ancestries??
To what extent does using ancestry-matched weights improve calibration of the trait prediction score??

Basic Materials

Laptop with enough storage and memory to handle genetic summary files.
Access to public GWAS summary statistics.
Access to a public genotype or cohort dataset with ancestry labels.
Spreadsheet software for tracking datasets and results.
Python installation with scientific libraries.
Text editor or notebook environment for code and notes.
Reference articles on polygenic risk scores and ancestry bias.
Headphones or a quiet workspace for long coding sessions.

Advanced Materials

Access to high-performance computing or a university workstation.
Large-scale genotype reference panel for ancestry-stratified analysis.
GWAS summary statistics for multiple traits.
Ancestry inference tools and quality-control scripts.
Python packages for machine learning and genomic analysis.
Version control system for code and experiment tracking.
Statistical analysis environment for calibration, ROC, and variance explained metrics.
Secure storage for human genomic data under your mentor's data policy.

Software & Tools

Python: Runs data cleaning, modeling, and statistical analysis for PRS evaluation.
pandas: Organizes summary statistics and prediction results into tables.
scikit-learn: Computes cross-validation, calibration, and performance metrics.
PLINK: Handles genotype quality control, scoring, and basic genomic file processing.
matplotlib: Makes clear plots of accuracy, bias, and ancestry comparisons.

Experiment Steps

Define one quantitative trait and decide which ancestry groups you will compare.
Select public discovery and test datasets that include ancestry labels and enough sample size.
Choose a standard PRS pipeline and a second model, such as a transformer-based approach, for comparison.
Plan a shared evaluation framework so both models get scored on the same samples and metrics.
Build controls for ancestry matching, sample overlap, and population structure.
Decide how you will report bias, portability, and uncertainty with plots and confidence intervals.

Common Pitfalls

Mixing datasets that used different trait definitions, which makes the comparison unfair.
Ignoring ancestry balance in the test set, which can hide portability gaps.
Using summary statistics from overlapping cohorts, which can inflate performance.
Comparing raw prediction scores without calibration, which makes ancestry groups look closer than they are.
Skipping quality control on public genotype data, which can create false signals from poorly imputed variants.

What Makes This Competitive

A strong version of this project goes past a simple accuracy comparison. You can test several ancestry groups, report calibration as well as variance explained, and show where the model fails, not just where it succeeds. You can also compare a standard PRS with a transformer-based method under the same evaluation plan. That kind of careful analysis shows real understanding of genetic prediction bias.

Project Variations

Compare PRS portability for height, LDL cholesterol, or another well-studied quantitative trait across ancestry groups.
Test whether ancestry-matched training subsets improve prediction more than a larger mixed-ancestry model.
Analyze how clumping and thresholding changes portability compared with a machine-learning PRS approach.

Learn More

NIH All of Us Research Program: Search the program site for background on ancestry diversity in genomic research and health equity.
PubMed: Search review articles on polygenic risk scores, ancestry bias, and portability.
NHGRI: Read the National Human Genome Research Institute pages on polygenic risk scores and genomic diversity.
UK Biobank: Use the resource description and published papers to understand large-cohort trait prediction studies.
Nature Reviews Genetics: Search the journal for review articles on PRS transferability and population structure.

Cellular and Molecular Biology Category Guide

How to Do Real Cellular and Molecular Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →