Community Detection Thresholds in Network Data

ISEF Category: Mathematics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Probability and Statistics · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Big networks can hide clear groups until the data gets just noisy enough to break them apart. That breaking point is called a phase transition, and it shows up in everything from social networks to biology. You can study where that line sits, then test whether real networks behave the way the theory predicts. This is a math project with real data and real stakes.

What Is It?

Community detection tries to find groups inside a network. Think of a school cafeteria, where clusters form around clubs, sports, or shared classes. A stochastic block model is a math way to build fake networks with known groups, so you can test whether an algorithm can recover them. Degree correction adds a second layer, because real networks do not give every person the same number of connections. Some nodes are hubs, and some are quiet. That extra variation makes the problem closer to real life, but harder to solve.

A phase transition is the tipping point where a method goes from mostly failing to mostly working. Before that point, the groups look too mixed up. After that point, the hidden structure becomes detectable. In this project, you study that boundary and ask whether a new information-theoretic threshold, the best possible limit set by math, matches what you see in simulations and real co-authorship networks from arXiv metadata.

Why This Is a Good Topic

This is a strong science fair topic because it starts with a clear yes-or-no question, then grows into real statistical modeling. You can test many settings by changing noise, group size, and degree imbalance, so the project stays measurable. It connects to network science, recommendation systems, social media analysis, and collaboration patterns in research. You can learn how to simulate data, compare algorithms, and read phase diagrams, which are all useful research skills.

Research Questions

How does increasing degree heterogeneity change the threshold for successful community detection?
What is the effect of group size imbalance on recovery accuracy in degree-corrected stochastic block models?
Does the detected phase transition match the predicted information-theoretic threshold across different noise levels?
To what extent do different clustering algorithms agree near the recovery boundary?
Which network statistics best predict whether a co-authorship graph has detectable community structure?
How does adding more weak ties between groups change the error rate of community recovery?

Basic Materials

Laptop with enough memory to run network simulations.
Python installed with NumPy, SciPy, NetworkX, and matplotlib.
Access to arXiv metadata or another public network dataset.
Spreadsheet software for tracking simulation settings and results.
Graph paper or notebook for planning variables and controls.
Basic statistics reference notes for accuracy, error rate, and confidence intervals.

Advanced Materials

Access to a university computing cluster or a strong workstation.
Python with scikit-learn, NetworkX, pandas, matplotlib, and seaborn.
Network analysis package such as graph-tool or igraph.
A public arXiv co-authorship dataset, cleaned and versioned.
Statistical testing tools for permutation tests, bootstrap intervals, and model comparison.
Documentation for spectral clustering, likelihood-based inference, and degree-corrected block models.

Software & Tools

Python: Runs simulations, fits network models, and computes recovery metrics.
Jupyter Notebook: Keeps code, notes, and plots in one place.
NetworkX: Builds synthetic graphs and explores network structure.
pandas: Organizes simulation outputs and metadata tables.
matplotlib: Makes threshold curves, phase plots, and comparison graphs.

Experiment Steps

Define the exact recovery task you want to study, such as detecting two groups, several groups, or unequal group sizes.
Choose the network model pieces you will vary, including noise level, degree correction, and group imbalance.
Plan the success metric you will use, such as accuracy, adjusted Rand index, or likelihood gap.
Build a simulation grid so you can map when recovery starts to work and when it fails.
Compare the empirical threshold from your simulations with the theoretical threshold from the literature.
Test the same framework on real co-authorship networks and check whether the pattern still holds.

Common Pitfalls

Using only one random seed, which makes a noisy phase transition look cleaner than it really is.
Changing several network parameters at once, which hides the effect of degree correction.
Measuring success with raw accuracy only, which can mislead you when groups are unbalanced.
Comparing theory to real data without cleaning the network, which mixes data issues with model failure.
Ignoring finite-size effects, which can shift the apparent threshold in smaller simulations.

What Makes This Competitive

A class-level version of this project only shows a few simulations. A stronger version maps the threshold across many settings and explains why the boundary moves. You can raise the level by comparing multiple recovery metrics, testing finite-size effects, and checking whether real networks match the model in the same way as synthetic ones. A careful uncertainty analysis matters just as much as the final plot.

Project Variations

Study phase transitions for community detection in weighted citation networks instead of unweighted graphs.
Compare degree-corrected and standard stochastic block models on real collaboration data.
Test whether overlapping communities change the recovery threshold in synthetic networks.

Learn More

MIT OpenCourseWare, Introduction to Probability and Statistics: search MIT OpenCourseWare for probability, random graphs, and statistical inference notes.
Network Science by Albert-László Barabási: a free online textbook with chapters on networks, community structure, and real-world graph patterns.
arXiv: search for review articles on stochastic block models, degree correction, and community detection thresholds.
PubMed: not the main source for this topic, but useful for finding network analysis papers in biology and medicine if you want a cross-domain comparison.
Probability Surveys: search the journal for review articles on random graphs and community detection thresholds.
NSF-supported network science resources: search university network science centers for free lecture notes and project examples.

Mathematics Category Guide

How to Do Real Mathematics Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →