Privacy-Preserving DNA Kinship Search

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Your DNA can act like a family name. A tiny slice of genetic markers can point to relatives you never meant to reveal. That makes forensic DNA search powerful, but it also raises privacy questions that are easy to test with public data.

What Is It?

This project studies kinship search, which means finding people who may be related by comparing their DNA patterns. You can think of DNA markers like a long barcode. If two barcodes share enough matching segments, the people may be relatives.

Locality-sensitive hashing, or LSH, is a fast way to find similar items without checking every pair by hand. Imagine sorting thousands of socks by color and pattern instead of comparing each sock to every other sock. In genomics, LSH helps group DNA profiles that look alike. Your project asks how few single nucleotide polymorphisms, or SNPs, still let you detect relatives from public genotype data.

Why This Is a Good Topic

This topic works well because you can test it with public data, clear metrics, and real privacy stakes. You can measure accuracy, speed, and the smallest SNP set that still finds kin. That gives you a clean engineering question and a policy angle. You can also learn data cleaning, similarity search, model evaluation, and basic ethics.

Research Questions

How does the number of SNPs affect kinship detection accuracy in public genotype data?
What is the effect of different LSH parameters on relative-finding precision and recall?
Does removing common SNPs change the minimum marker set needed to detect first-degree relatives?
To what extent does ancestry group affect kinship search performance under the same SNP budget?
Which similarity threshold best separates true relatives from unrelated pairs in a public dataset?
How does the runtime of LSH compare with all-pairs comparison as the sample size grows?

Basic Materials

Computer with enough storage for genotype files.
Spreadsheet software or a notebook environment for organizing metadata.
Python with NumPy, pandas, SciPy, and scikit-learn.
Public 1000 Genomes genotype data from the 1000 Genomes Project or a mirrored academic source.
PubMed access for background papers on forensic kinship and SNP privacy.
Basic reference notes on ethics and human-subject data rules.

Advanced Materials

University server or workstation with more memory for large genotype matrices.
Python packages for genomic data handling, such as scikit-allel or pysam.
Jupyter Notebook for reproducible analysis.
Version control software, such as Git.
Access to a secure data environment approved for handling human genotype datasets.
Visualization tools for ROC curves, precision-recall curves, and runtime plots.

Software & Tools

Python: Runs your similarity search, data cleaning, and evaluation code.
Jupyter Notebook: Keeps your analysis, plots, and notes in one reproducible file.
scikit-learn: Helps you compute clustering, classification, and evaluation metrics.
pandas: Organizes genotype tables, sample metadata, and result summaries.
ImageJ: Not used here and should be skipped for this topic.

Experiment Steps

Define the privacy question you want to test, such as the smallest SNP subset that still identifies relatives with useful accuracy.
Select a public genotype dataset and decide which relationship types you will treat as true matches.
Build a baseline similarity method, then add LSH so you can compare speed and retrieval quality.
Choose the metrics that matter most, such as recall, precision, false match rate, and runtime.
Plan a marker-selection strategy so you can test different SNP counts and different SNP subsets fairly.
Design controls that separate real kinship signal from ancestry similarity and random matches.

Common Pitfalls

Using too few samples, which makes kinship accuracy look better or worse than it really is.
Mixing up relatives and unrelated pairs in the labels, which breaks your evaluation.
Comparing SNP sets of different quality without controlling for missing data, which skews the minimum marker estimate.
Ignoring ancestry structure, which can make the model seem to find kin when it only finds shared population patterns.
Reporting only one threshold, which hides how quickly false positives rise as you lower the SNP count.

What Makes This Competitive

A stronger project will not just ask whether kinship search works. It will measure how privacy risk changes across SNP budgets, ancestry groups, and search settings. Strong entries also compare multiple metrics, not just one accuracy number. If you add a careful fairness or policy analysis, your work becomes much more useful than a simple code demo.

Project Variations

Test whether rare SNPs or common SNPs carry more kinship signal in public genotype data.
Compare LSH with exact nearest-neighbor search to see how much speed you gain and what accuracy you lose.
Repeat the analysis on different relationship types, such as parent-child, siblings, and distant relatives, to see where privacy risk drops off.

Learn More

1000 Genomes Project: Search the project site for public genotype data and sample metadata.
NCBI PubMed: Search for review articles on forensic genomics, kinship analysis, and SNP privacy.
NIH MedlinePlus Genetics: Read plain-language background on SNPs, inheritance, and DNA variation.
NCBI Bookshelf: Find free textbooks and chapters on genetics, bioinformatics, and sequence analysis.
MIT OpenCourseWare: Search for machine learning and algorithms courses that cover hashing and similarity search.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →