Private Streaming Quantile Sketches for Big Data

Private Streaming Quantile Sketches for Big Data

ISEF Category: Systems Software

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Algorithms  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

Every ride app keeps a secret about you, even when it only stores summaries. A good sketch can answer fast questions about huge data streams without keeping every record. Add privacy, and the design gets harder. You are trying to keep the useful answer while hiding any one rider's trip.

What Is It?

A streaming quantile sketch is a compact data structure that estimates percentiles, like the median or the 90th percentile, while data keeps arriving. Think of it like a tiny notebook that cannot store every taxi trip, but still remembers enough to answer useful questions about trip duration.

Differential privacy adds a noise budget that limits how much any single rider can influence the result. That protects people in the data, but noise can also blur the answer. Your project studies the tradeoff between three things: error, privacy, and space. The goal is to design a sketch that stays small, keeps private data private, and still gives a good estimate of the quantile you want.

Why This Is a Good Topic

This topic works well because you can measure it clearly. You can compare your sketch against a standard quantile method and track error, memory use, and privacy settings. It connects to a real problem, which is how to analyze live data streams without exposing people in the records. You can also learn data structures, privacy concepts, and performance testing in one project.

Research Questions

  • How does the privacy parameter affect quantile error in a streaming sketch?
  • What is the effect of sketch size on median and percentile accuracy?
  • Does adding differential privacy change error more for extreme quantiles than for the median?
  • To what extent does the sketch stay accurate as the data stream grows longer?
  • Which taxi trip-duration subsets produce the largest privacy-accuracy tradeoff?
  • How does the sketch compare with a nonprivate baseline under the same memory limit?

Basic Materials

  • Laptop or desktop computer with Python installed.
  • Public taxi trip-duration dataset from a government open data portal or a university data repository.
  • Python libraries for data analysis, such as NumPy and pandas.
  • Plotting library, such as Matplotlib or Seaborn.
  • Random number generator support in Python.
  • Spreadsheet software for tracking trial results.
  • Notebook for recording parameter choices and observations.

Advanced Materials

  • Workstation with enough memory to run repeated simulations.
  • Public taxi trip-duration stream or another time-stamped stream dataset.
  • Python implementation of streaming quantile sketches.
  • Differential privacy library or a custom noise mechanism in Python.
  • Profiling tools for memory and runtime measurement.
  • Statistical testing tools in Python or R.
  • Version control software for experiment tracking.

Software & Tools

  • Python: Runs simulations, computes quantile estimates, and measures error across trials.
  • Jupyter Notebook: Helps you document code, plots, and experiment changes in one place.
  • pandas: Loads and cleans large public stream datasets.
  • NumPy: Handles random sampling, array math, and repeated trials.
  • Matplotlib: Plots error, privacy settings, and memory use so you can compare methods.

Experiment Steps

  1. Define the exact quantile problem you want to solve, such as a median or a high percentile of trip duration.
  2. Choose one privacy mechanism and one baseline sketch so you can compare private and nonprivate performance.
  3. Decide which variables you will change first, such as privacy budget, sketch size, or stream length.
  4. Build a test plan that measures accuracy, memory use, and runtime on the same data stream.
  5. Create a comparison method that separates normal estimation error from privacy noise.
  6. Plan stress tests on different slices of the stream, such as short trips, long trips, or rush-hour data.

Common Pitfalls

  • Mixing up privacy noise with sketch approximation error, which makes your results impossible to interpret.
  • Testing only one quantile, which hides whether the method fails on tails like the 90th or 95th percentile.
  • Using a dataset that is already heavily cleaned, which can make the privacy setting look better than it really is.
  • Comparing methods with different memory budgets, which turns the study into an unfair benchmark.
  • Ignoring repeated trials, which leaves you with noisy results that may look better or worse by chance.

What Makes This Competitive

A strong project will do more than run one algorithm once. You need careful baselines, repeated trials, and a clear way to separate privacy loss from approximation loss. You will look stronger if you test several quantiles, several data slices, and several memory limits. A top project also explains why the tradeoff changes, not just which method wins.

Project Variations

  • Test the sketch on taxi pickup times instead of trip duration to see whether different distributions change the privacy tradeoff.
  • Compare the same private sketch on two public streams, such as taxi rides and bike-share trips, to test how data shape affects error.
  • Evaluate how the method performs on sliding windows instead of the full stream to model a live monitoring task.

Learn More

  • MIT OpenCourseWare, Introduction to Algorithms: Search for lectures on streaming algorithms, data structures, and order statistics.
  • NIST Differential Privacy resources: Read government guides on privacy definitions and testing methods.
  • US Census Bureau Differential Privacy materials: Review public explanations of privacy mechanisms and released data.
  • arXiv: Search for recent papers on differentially private quantile estimation and streaming sketches.
  • PubMed: Search for review articles on data privacy in health and mobility data, then compare the methods used.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Hub →

Shopping Cart