Streaming Community Drift Detection in Social Graphs
ISEF Category: Systems Software
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Algorithms · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
Social networks change faster than most clustering methods can keep up. A group that looks stable one week can split, merge, or drift the next. Your project asks how to catch those changes without rerunning a huge algorithm every time. That is the core problem behind modern social graph analysis.
What Is It?
This phenomenon is about finding changes in communities inside a graph as new data arrives. A community is a group of nodes that connect to each other more often than to the rest of the network. In a social graph, that can mean people, subreddits, news sources, or accounts that behave like a cluster.
Think of it like watching a school lunch table map. If friends keep sitting together, the group stays stable. If some people switch tables, new clusters form, and old ones break apart. A streaming algorithm tries to notice those shifts while the graph keeps changing, instead of starting over from scratch each time.
Personalized PageRank is one way to measure how strongly a node belongs near a chosen seed node or group. A sketch is a compressed summary of the graph that saves space and speed. Your project can test whether a sketched personalized PageRank method detects drift fast enough, and whether it stays close to expensive full recomputation methods like Louvain and Leiden.
Why This Is a Good Topic
This is a strong science fair topic because you can turn a big, messy, real-world graph problem into a clear test. You can measure runtime, memory use, and how well your method matches known community changes. The topic connects to misinformation tracking, online behavior, and event detection in social media, so the real-world value is easy to explain. You can also learn algorithm design, graph theory, and experimental comparison in one project.
Research Questions
- How does sketch size affect drift detection accuracy in evolving social graphs?
- What is the effect of update rate on the runtime gap between streaming detection and full Louvain reruns?
- Does sketched personalized PageRank preserve community assignments as well as Leiden when new edges arrive?
- To what extent do different graph sources, such as GDELT and Reddit dumps, change the stability of detected communities?
- Which threshold settings best separate real community drift from random noise in the stream?
- How does seed choice affect precision and recall for detecting drifting communities?
Basic Materials
- Laptop with at least 16 GB RAM.
- Python.
- NetworkX or graph-tool.
- Jupyter Notebook or Google Colab.
- Public graph dataset from GDELT or Reddit dumps.
- CSV or JSON parser.
- External storage for large data files.
- Spreadsheet software for tracking runs and results.
Advanced Materials
- Workstation or server with high RAM and multi-core CPU.
- Python with NumPy, SciPy, pandas, NetworkX, and a graph clustering library.
- Implementation of personalized PageRank or a graph sketching method.
- Access to large-scale GDELT or Reddit data archives.
- Profiling tools for runtime and memory measurement.
- Visualization tools for graph evolution plots.
- Version control system for experiment tracking.
- Optional GPU only if your chosen analysis pipeline benefits from it.
Software & Tools
- Python: Lets you build the streaming algorithm, run experiments, and analyze results.
- NetworkX: Handles graph construction, basic graph metrics, and community-related preprocessing.
- pandas: Organizes event tables, edge lists, and experiment outputs.
- Jupyter Notebook: Keeps code, notes, and plots in one place while you test ideas.
- Gephi: Helps you inspect how communities shift over time with visual graph layouts.
Experiment Steps
- Define the drift signal you want to detect, such as community split, merge, or membership change.
- Choose one graph stream and one baseline method so your comparison stays focused.
- Design the sketch or approximation rule that will summarize the graph between updates.
- Decide how you will score accuracy, speed, memory use, and stability against reruns.
- Build a test plan that compares your streaming method with Louvain and Leiden on the same time slices.
- Check whether your method still works when you vary seed nodes, graph density, and update frequency.
Common Pitfalls
- Using a dataset that is too small or too static, which hides the difference between streaming detection and reruns.
- Comparing methods on different graph slices, which makes the accuracy results unfair.
- Treating community labels as fixed when Louvain and Leiden can relabel clusters between runs.
- Ignoring memory growth in the sketch, which can make a fast method fail on large streams.
- Measuring only runtime and not agreement with known drift, which leaves the main claim unproven.
What Makes This Competitive
A strong version of this project does more than say one method runs faster. It shows when the approximation fails, why it fails, and how that changes across different graph types. You can make the project stand out by testing several drift definitions, reporting both speed and error, and using statistics that compare methods across many time slices. A careful study of failure cases will usually be stronger than a single big chart.
Project Variations
- Test the same streaming drift method on a citation network instead of social media to see whether the graph type changes performance.
- Compare personalized PageRank sketches with a simpler degree-based drift detector to measure the value of the sketch.
- Focus on one platform, such as Reddit communities, and test whether topic changes or moderator events create detectable drift.
Learn More
- MIT OpenCourseWare, Introduction to Algorithms: Search the course materials for graph algorithms, streaming ideas, and analysis of approximate methods.
- Network Science by Albert-László Barabási: Look for the free online text to study community structure and evolving networks.
- PubMed: Search for review articles on graph mining in social and biomedical networks to see how researchers frame dynamic network analysis.
- arXiv: Search for preprints on dynamic community detection, personalized PageRank, and streaming graph algorithms.
- GDELT Project: Use the public documentation and event data pages to find large-scale news graph data.
- Reddit data archives: Search public Reddit dump documentation and community datasets for time-stamped discussion graphs.
Systems Software Category Guide
How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
