Wikipedia Biography Network Bias
ISEF Category: Behavioral and Social Sciences
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Sociology and Anthropology · Difficulty: Intermediate · Setup: Home Setup · Time: 1 to 2 Months
The Hook
A biography page can look neutral and still sit far from the rest of the network. That matters, because link structure helps readers find people in the first place. You can measure those hidden routes and test whether a rewiring rule shrinks the gap.
What Is It?
This project treats Wikipedia like a map of roads. Each biography page is a stop, and each hyperlink is a road that can send readers to another page. If some groups sit in busy hubs and others stay on quiet side streets, the network itself may help spread or hide attention.
You are not asking whether a page exists. You are asking where that page sits in the web of links. Gender and nationality can affect that position. A counterfactual rewiring test asks a simple question, what would the network look like if the same pages had links arranged in a different way but kept the same overall size and shape?
Why This Is a Good Topic
This is a strong science fair topic because you can turn a social question into numbers. You can test clear variables, compare groups, and use graph measures like degree, betweenness, and path length. It connects to bias in online information systems, and you can do the full project with public data and a home computer.
Research Questions
- How does average degree differ between biography pages for women and men within the same nationality group?
- What is the effect of nationality on the average shortest-path distance from biography pages to high-traffic hub pages?
- Does counterfactual rewiring reduce the gap in betweenness centrality between gender groups?
- To what extent do article age and page length explain differences in link centrality across nationality groups?
- Which network metric best separates biographies from high-link and low-link groups?
- How does changing the definition of the biography sample alter the measured under-representation pathway?
Basic Materials
- Laptop with internet access.
- Spreadsheet software such as Google Sheets or Excel.
- Python with NetworkX, Pandas, and Matplotlib.
- Free Wikipedia page lists or Wikimedia dump files.
- A notebook for logging labels and graph rules.
Advanced Materials
- Local copy of Wikimedia dumps with enough storage for the full graph.
- Python with NetworkX, igraph, Pandas, and SciPy.
- Gephi for visual checks on network structure.
- R with tidyverse and igraph for alternate analysis.
- A label file that maps each biography to gender and nationality.
Software & Tools
- Python: Builds the graph, calculates metrics, and runs the null models.
- NetworkX: Computes centrality, shortest paths, and community measures.
- Gephi: Lets you inspect the network visually and spot structural clusters.
- Pandas: Cleans labels and merges article metadata into one table.
- Wikimedia API: Pulls article links and metadata directly from Wikipedia.
Experiment Steps
- Define the biography sample and lock your gender and nationality labels.
- Build the link graph and decide which pages, links, and redirects to exclude.
- Choose the network measures that match your question, such as degree, betweenness, and path length.
- Set a counterfactual rewiring rule that keeps the graph size stable while changing link pathways.
- Compare the observed graph with the rewired graph using the same metrics and a simple statistical test.
- Check whether your results stay similar when you change sample rules or label sources.
Common Pitfalls
- Mixing biography pages with non-biography pages, which inflates link counts and blurs group comparisons.
- Using inconsistent gender labels across sources, which makes group sizes and centrality scores hard to trust.
- Comparing raw link counts without normalizing for page age or page length, which favors older, longer biographies.
- Treating nationality as a single clean category when some biographies have dual or changing national identities.
- Rewiring links without preserving key graph properties, which makes the counterfactual comparison meaningless.
What Makes This Competitive
A strong version of this project does more than report a gap. It compares several null models, not just one rewiring rule, and asks whether the pattern survives controls for article age, page length, and network size. You can also test whether one metric tells a different story than another, which makes the analysis more useful than a simple count.
Project Variations
- Repeat the network analysis on women, men, and nonbinary biographies from one language edition to see whether the pattern changes across label groups.
- Compare English Wikipedia with another language edition to test whether nationality gaps grow or shrink across editorial cultures.
- Swap hyperlink structure for category links or reference links to see whether the under-representation pathway appears in a different graph layer.
Learn More
- Wikimedia Dumps: Download article and link data from the Wikimedia data portal.
- Wikipedia API documentation: Learn how to pull page links and article metadata from the MediaWiki API help pages.
- NetworkX documentation: Find graph metrics and worked examples in the official NetworkX docs.
- Gephi tutorials: Build and inspect networks with the free Gephi software and its help center.
- MIT OpenCourseWare: Search for free graph theory and network science lecture notes.
Behavioral and Social Sciences pillar guide
How to Do Real Behavioral and Social Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →