Mining API Specs From Web Traffic

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Web APIs often act like black boxes. You send requests, you get responses, and the rules stay hidden. A tool that watches traffic and writes the rules for you can save hours of manual work. Your project asks how close that tool can get to the truth.

What Is It?

This topic is about reverse engineering an API from the messages it sends and receives. An API, or application programming interface, is the set of rules a program uses to talk to another program. If you watch enough HTTP traffic, you can guess the endpoints, the request fields, the response shapes, and some patterns that should always hold.

Think of it like learning the rules of a game by watching people play. You do not see the rulebook at first, but you can still infer that certain moves are allowed and others are not. In this project, the tool tries to write an OpenAPI spec, which is a standard format for describing API routes, inputs, and outputs. It also tries to find invariants, meaning rules that stay true, and property-based tests, which are tests that check those rules across many cases.

Why This Is a Good Topic

This is a strong science fair topic because you can measure how well the tool reconstructs a real system. You can compare its inferred spec against known public API specs, then score precision, recall, and test coverage. The problem connects to software maintenance, security, and automated testing, which are real needs in industry. You can learn API analysis, parsing, evaluation design, and experiment comparison.

Research Questions

How does the amount of observed HTTP traffic affect the accuracy of the inferred OpenAPI spec?
What is the effect of API complexity on endpoint and parameter recovery?
Does filtering noisy or duplicate requests improve invariant discovery?
To what extent does traffic from a single client versus multiple clients change the quality of inferred property-based tests?
Which parts of the spec, such as path patterns, request bodies, or response schemas, are recovered most accurately?
How does the tool perform on APIs with stable schemas compared with APIs that return highly variable data?

Basic Materials

Computer with a modern operating system and at least 16 GB RAM.
Network capture or proxy tool such as mitmproxy.
Access to several public REST APIs with published documentation.
A local scripting environment such as Python.
Spreadsheet software or a notebook for tracking evaluation scores.
JSON validator or formatter.
Git for version control.

Advanced Materials

Server or cloud VM for running repeated capture jobs.
HTTP replay or load-generation tool for controlled traffic collection.
OpenAPI parser or schema comparison library.
Property-based testing framework such as Hypothesis.
Packet capture tools such as Wireshark.
Container platform such as Docker for reproducing API clients and test runs.
Statistical analysis environment such as Python with SciPy or R.

Software & Tools

OpenAPI Specification: Defines the API format you will compare against when scoring inference quality.
mitmproxy: Captures and inspects HTTP traffic so you can collect requests and responses.
Python: Lets you parse logs, compare schemas, and automate evaluation.
Hypothesis: Generates property-based tests from inferred rules and checks them against the API.
Wireshark: Helps verify lower-level network details when traffic capture looks inconsistent.

Experiment Steps

Define which API features you want to recover first, such as routes, parameters, schemas, or invariants.
Choose a set of public APIs with known OpenAPI specs so you can compare predictions against ground truth.
Plan how you will collect traffic in a repeatable way, including client behavior, request variety, and logging format.
Decide on scoring metrics for each output, such as exact match, field overlap, or test pass rate.
Build controls that separate true inference gains from simple memorization of repeated requests.
Compare traffic volume, API complexity, and client diversity to see which factors change recovery quality most.

Common Pitfalls

Using only one short capture session, which leaves the inferred spec too incomplete to evaluate fairly.
Mixing requests from different API versions, which makes the ground truth look inconsistent.
Treating optional fields as required, which inflates false positives in the recovered schema.
Ignoring authentication or rate limits, which causes replay tests to fail for reasons unrelated to inference quality.
Scoring only endpoint names and skipping response structure, which misses the main value of spec mining.

What Makes This Competitive

A strong version of this project goes past a simple demo. You compare against a real ground truth spec, define clear metrics, and test more than one class of API. You also look at failure cases, not just average scores, so you can explain where the tool breaks and why. If you add a new evaluation angle, such as invariants that catch hidden schema rules, the project gets much stronger.

Project Variations

Try the same inference pipeline on REST APIs with different authentication styles, then compare how access control changes recovery quality.
Swap in GraphQL or gRPC traffic and measure how much harder the spec-mining problem becomes.
Focus only on invariant discovery, then test whether those invariants improve downstream property-based test generation.

Learn More

OpenAPI Specification: Read the official spec and examples on the OpenAPI Initiative website.
MIT OpenCourseWare: Search for software engineering or testing courses that cover API design, data formats, and program analysis.
PubMed: Search for papers on automated software testing if you want methods for property-based evaluation.
IEEE Xplore: Search for research on API mining, software reverse engineering, and automated test generation.
arXiv: Search for recent preprints on specification mining, program synthesis, and API inference.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →