Privacy-Preserving LLM Training Set Auditor

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Your old post can stay online for years, even after you delete it. Large language models can also absorb text from public web data, which makes that post harder to track. You can build a tool that checks whether a post likely landed in a training set without sending the post to a server. That turns a privacy problem into a measurable software question.

What Is It?

This project asks a simple question with a hard answer, did a public post get pulled into a large language model training set? Training sets are giant text collections used to teach models patterns in language. A membership-inference test tries to tell whether one specific text was part of the training data. An n-gram lookup breaks the text into short word sequences, then checks whether those sequences appear in known datasets.

Think of it like fingerprint matching. One fingerprint might not prove anything. A cluster of matches can create a stronger signal. Your auditor can combine both ideas, then keep the whole check inside the user’s browser or local app. That way, the text never has to leave the device.

Why This Is a Good Topic

This is a strong science fair topic because you can test a clear input and output, and then measure how well your auditor detects known training-set text versus text that was never included. It connects to privacy, AI accountability, and data governance, which are real problems with public stakes. You can also study accuracy, false positives, and performance tradeoffs, which makes the project more than just a demo.

Research Questions

How does n-gram length affect the accuracy of detecting whether a post appears in a known LLM training set?
What is the effect of combining membership inference with n-gram lookup on false positive rates?
Does the auditor identify copied text more accurately than paraphrased text?
To what extent does post length change detection success for public web text?
Which type of scoring rule best separates included posts from excluded posts?
How does client-side processing affect runtime as dataset size increases?

Basic Materials

Laptop or desktop computer with a modern browser.
Python or JavaScript environment for text processing.
Public training-set samples or accessible dataset excerpts from The Pile, RedPajama, or Dolma documentation.
A set of user-authored test posts and matched control posts.
Text files or CSV files for storing excerpts and labels.
Spreadsheet software for tracking results.
Basic plotting tool for accuracy, precision, and recall graphs.

Advanced Materials

University workstation or powerful laptop for local indexing.
Larger text corpora from public dataset mirrors or research-access samples.
A local database or search index such as SQLite, FAISS, or Elasticsearch for n-gram retrieval.
Python libraries for tokenization, similarity scoring, and evaluation.
A privacy-preserving browser extension or local app framework for interface testing.
Statistical analysis software for ROC curves, calibration plots, and significance testing.
Version control system for reproducible experiments.

Software & Tools

Python: Processes text, builds n-gram features, and scores membership signals.
Jupyter Notebook: Helps you explore features, test thresholds, and graph results.
ImageJ: Not needed for this topic, so skip it and keep the project text-focused.
SQLite: Stores local lookup tables and lets you test client-side search logic.
R: Runs statistical tests and builds clear comparison plots for model performance.

Experiment Steps

Define your detection target, such as exact text matches, near matches, or paraphrases.
Build a local corpus of known-inclusion and known-exclusion samples, then label them carefully.
Design your feature pipeline, including n-grams, overlap scores, and any membership-inference metric.
Set up controls that separate true inclusion from simple topical similarity or copied phrases.
Choose evaluation metrics, such as precision, recall, false positive rate, and runtime on device.
Compare different thresholds and scoring rules, then pick the one that best balances privacy and detection.

Common Pitfalls

Treating topical similarity as proof of inclusion, which inflates false positives.
Using only exact n-gram matches, which misses paraphrased or lightly edited posts.
Mixing training-set excerpts and test posts from the same source, which leaks labels into the evaluation.
Skipping a clean control group, which makes it hard to tell whether the auditor works better than chance.
Sending raw user text to a remote server during testing, which breaks the privacy goal of the project.

What Makes This Competitive

A stronger version of this project goes past a simple yes-or-no checker. You would test several inference signals, measure privacy tradeoffs, and report where the auditor fails. You could compare exact match, fuzzy match, and semantic similarity, then show which cases each method handles best. Clean evaluation and honest error analysis matter more than flashy code.

Project Variations

Test the auditor on social media posts versus blog-style writing to see whether post style changes detection rates.
Compare exact n-gram lookup with fuzzy token overlap to measure how much paraphrasing weakens detection.
Build a browser-only version and compare its speed, memory use, and accuracy with a local desktop version.

Learn More

PubMed: Search for review articles on membership inference attacks and privacy in machine learning, then use them to frame your evaluation methods.
arXiv: Search for recent preprints on training data extraction, memorization, and dataset auditing in large language models.
NIH Office of Data Science Strategy: Review public materials on data privacy, governance, and reproducible analysis in biomedical data systems.
MIT OpenCourseWare: Look for computer science courses on information retrieval, privacy, or machine learning systems to strengthen your technical background.
ACL Anthology: Search for peer-reviewed papers on language model memorization, dataset contamination, and extraction attacks.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →