Detecting Human Positive Selection in Genome Data

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Computational Evolutionary Biology · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Some human traits spread fast because they helped people survive. Lactase persistence, skin pigmentation, and EDAR variants are classic examples. With public genome data, you can test those signals yourself instead of just reading about them. Your project can turn evolution into numbers.

What Is It?

This project looks for signs of recent positive selection, which means a DNA variant became more common because it helped people survive or reproduce. You are not guessing from a story. You are measuring patterns in population data, such as how often a variant appears and how nearby DNA looks around it.

Think of the genome like a neighborhood map. If one house gets upgraded and suddenly all the nearby houses change with it, that can leave a pattern behind. In genetics, a favored variant can drag nearby DNA along as it spreads, and that leaves a signal called a selective sweep or a weaker version of it. Your job is to compare that signal at known loci, like lactase, EDAR, and skin-pigmentation genes, across human populations.

A composite selection-density statistic would combine several clues into one score. That score might use allele frequency, population differentiation, and local haplotype structure. You then test whether your score highlights known selected loci better than simpler measures do.

Why This Is a Good Topic

This is a strong science fair topic because the question is clear, measurable, and tied to real human biology. You can use public datasets, define your own scoring method, and compare known loci against neutral regions. The real-world link is human adaptation, diet, and skin biology, which makes the story easy to explain. You can learn population genetics, data cleaning, statistics, and how to judge whether a signal is real or just noise.

Research Questions

How does the composite selection-density statistic compare with single-metric tests at identifying known human selection loci?
What is the effect of population choice on the selection signal at lactase, EDAR, and pigmentation loci?
Does the selection-density score separate known selected regions from matched neutral regions better than random expectation?
To what extent do allele-frequency differences between continents predict the selection-density score at each locus?
Which components of the composite statistic contribute most to detecting selection in each gene region?
How does window size around a locus change the strength of the selection-density signal?
What is the effect of using gnomAD versus 1000 Genomes data on the ranking of candidate selected regions?

Basic Materials

Computer with internet access and enough storage for genomic data files.
Spreadsheet software for organizing variant summaries.
Command-line access on a laptop or school computer.
R or Python installed locally.
Public variant data from gnomAD and 1000 Genomes.
Reference genome annotation files for locating genes and nearby windows.
Basic statistics reference for z-scores, p-values, and correlation.

Advanced Materials

High-memory workstation or university compute cluster.
Python with pandas, numpy, scipy, and matplotlib.
R with tidyverse, ggplot2, and stats packages.
VCF processing tools such as bcftools and PLINK.
Haplotype and selection tools such as selscan or scikit-allel.
Genome browser tracks and annotation files for target loci.
Access to published recombination maps for local background correction.
Version control with Git for tracking analysis changes.

Software & Tools

Python: Cleans variant tables, calculates summary statistics, and plots selection scores.
R: Tests whether your composite statistic separates selected and neutral regions.
IGV: Lets you inspect local variant patterns around candidate loci.
UCSC Genome Browser: Helps you compare gene locations, nearby windows, and annotations.
GitHub Desktop: Tracks analysis versions so you can see how your score changes.

Experiment Steps

Define the biological question and pick a small set of loci that already have strong selection evidence.
Decide which public datasets, populations, and genome windows you will compare.
Design your composite selection-density statistic so each input feature has a clear meaning.
Plan your neutral control regions by matching gene length, allele frequency range, and recombination context.
Build a validation strategy that checks whether your score recovers known selected loci above background.
Choose the plots and statistical tests you will use to compare loci, populations, and control regions.

Common Pitfalls

Mixing populations with very different ancestry backgrounds, which can make normal frequency differences look like selection.
Using a score that favors common variants everywhere, which makes the statistic look strong even for neutral regions.
Comparing candidate loci to random control regions that do not match gene length or recombination rate.
Ignoring missing data or low-quality variants, which can distort local selection signals.
Treating one famous locus as proof that the whole statistic works, which leaves no real validation of the method.

What Makes This Competitive

A stronger project would not stop at a single score. You would test whether your statistic beats simpler ones across multiple populations, loci, and matched controls. You could also show where it fails, which makes the method more honest and more useful. The best versions add sensitivity analysis, effect-size estimates, and clear validation on known biology.

Project Variations

Compare selection-density signals in diet-related genes versus pigmentation genes across more than one ancestry group.
Swap the composite score for a machine-learning classifier and test whether it ranks known loci better.
Expand the analysis to archaic introgression regions or other adaptive loci, then ask whether the same signal pattern appears.

Learn More

1000 Genomes Project: Search the project site for population genotype data, variant files, and background materials on global human variation.
gnomAD: Use the browser and downloads page to explore allele frequencies and variant constraints across populations.
NIH Genetics Home Reference archives: Read plain-language overviews of genes such as lactase and EDAR, then connect them to phenotype.
NCBI Bookshelf: Search for free textbook chapters on population genetics, selection, and haplotype structure.
PubMed: Search review articles on recent positive selection, selective sweeps, and human adaptation.
UCSC Genome Browser: Inspect genomic neighborhoods, annotations, and reference tracks for your candidate loci.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →