NYC Subway Delay Percolation Model
ISEF Category: Mathematics
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Probability and Statistics · Difficulty: Advanced · Setup: Home Setup · Time: Full Year
The Hook
A subway delay in one station can ripple through an entire line. That makes transit feel random, but not chaotic. You can model those ripples with math and real-time data. If you do it well, your project can connect pure probability to a problem millions of riders know.
What Is It?
First-passage percolation is a math model for how something spreads through a network when each link has a random travel time. Picture a city map where every track segment has a different delay. The question is not just how long one trip takes. It is how delay patterns grow across the whole system.
For NYC subway data, you can treat stations or trip segments like a network and use GTFS-realtime feeds to record delays as they happen. GTFS-realtime is a public transit data format that many agencies use to report current vehicle positions and trip updates. Your model can ask whether delay times follow a normal pattern, or whether they have heavy tails, which means rare big delays happen more often than a simple bell curve would predict.
A limit shape is the overall pattern you get when you scale up many random travel times together. Think of drops of dye spreading in water. One drop looks messy, but many drops can form a clear boundary. In this project, you are trying to see whether subway delays create a stable large-scale shape, and whether that shape changes when delays get extreme.
Why This Is a Good Topic
This is a strong science fair topic because you can test real, messy data with clear math questions. You do not need a wet lab. You need public data, careful cleaning, and solid statistical thinking. The topic connects to transit reliability, urban planning, and network science, so your results have real-world value. You can learn how to model randomness, compare distributions, and defend your choices with evidence.
Research Questions
- How does delay length vary across NYC subway lines during peak and off-peak hours?
- What is the effect of station location on the size of propagated delays in a line network?
- Does a heavy-tailed distribution fit subway travel times better than a normal or lognormal model?
- To what extent do delay clusters form a stable limit shape across repeated service disruptions?
- Which network features, such as transfer hubs or terminal stations, predict the largest travel-time shocks?
- What is the effect of weather events on the tail behavior of GTFS-realtime delay data?
Basic Materials
- Laptop with internet access.
- Python installed with pandas, numpy, scipy, matplotlib, and statsmodels.
- Public GTFS-realtime access or archived transit feed data.
- Spreadsheet software for quick data checks.
- Notebook for tracking data-cleaning rules and model choices.
- Digital map of the subway network or station list.
- External storage for raw and cleaned datasets.
Advanced Materials
- University or cloud server access for larger data storage.
- Python with networkx, geopandas, and powerlaw.
- Access to archived GTFS-realtime feeds, if available.
- Statistical software for distribution fitting and hypothesis tests.
- GIS software for mapping delay propagation.
- Version control with Git for reproducible analysis.
- High-capacity storage for repeated feed snapshots.
Software & Tools
- Python: Cleans feed data, fits distributions, and runs simulation models.
- Jupyter Notebook: Keeps code, notes, and graphs in one place.
- pandas: Organizes GTFS-realtime records into analysis-ready tables.
- scipy: Fits probability models and tests distribution assumptions.
- matplotlib: Makes delay plots, histograms, and network visuals.
- NetworkX: Represents the subway as a graph for path and propagation analysis.
Experiment Steps
- Define the exact network question you want to test, such as one line, one corridor, or the full system.
- Choose the delay measure you will analyze, then write a clear rule for cleaning missing or duplicated records.
- Build a baseline distribution model and compare it with at least one heavy-tailed alternative.
- Map the subway as a network and decide which stations or segments count as sources, sinks, and transfer points.
- Design a method for checking whether delay growth follows a stable scaling pattern across events.
- Plan a validation step that tests your model on a different time period or a different line.
Common Pitfalls
- Mixing live feed snapshots with archived records, which creates fake delay jumps that are really data timing errors.
- Treating every outlier as noise, which can erase the heavy-tailed pattern you are trying to measure.
- Ignoring station transfers, which makes the network model miss where delays spread fastest.
- Comparing raw delay minutes across lines without normalization, which can hide structural differences in service frequency.
- Fitting one probability curve and stopping there, which leaves you without evidence that your heavy-tailed model is better.
What Makes This Competitive
A stronger project goes past simple charts and asks whether the subway network follows a specific mathematical law. You can stand out by comparing several distribution families, testing goodness of fit, and reporting uncertainty clearly. Strong projects also separate weekday, weekend, weather, and disruption effects instead of mixing them together. If your model predicts new delay behavior on unseen data, that is a big step up from a basic descriptive study.
Project Variations
- Focus on one subway line and compare delay tails across different station pairs.
- Replace NYC with another city that publishes GTFS-realtime data and test whether the same model still fits.
- Analyze weather or event days separately to see whether extreme delays become more common.
Learn More
- GTFS-realtime documentation: Learn the transit feed format and field definitions by searching the official GTFS-realtime guide from Google.
- NYC Open Data: Look for subway performance, service status, and transit-related datasets.
- MTA developer and data pages: Find official subway data feeds and documentation from the Metropolitan Transportation Authority.
- PubMed: Search for review articles on transportation network modeling, travel-time reliability, and heavy-tailed delays.
- Probability and Random Processes: Use a standard university textbook or library copy for background on random variables, tail behavior, and limit theorems.
- MIT OpenCourseWare: Search for probability, statistics, and stochastic processes courses for free lecture notes and problem sets.
Mathematics Category Guide
How to Do Real Mathematics Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
