Schema Inference and PII Masking for Open Data

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Databases · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Messy public data can hide names, addresses, and other private details in plain sight. One bad export can turn a useful dataset into a privacy mess. Your project can build a tool that reads chaos, finds structure, and masks sensitive fields before anyone queries them. That makes this topic useful, technical, and very testable.

What Is It?

This project asks you to build a system that looks at messy CSV and JSON files and guesses what the data means. Think of it like a smart librarian who can sort a box of mixed papers. The system tries to figure out column names, data types, and which fields may contain private information like phone numbers, emails, or addresses.

Then it turns that guess into Postgres code. Postgres is a database system, and DDL means data definition language, the commands that create tables and columns. You also add a masked view layer, which means a safer version of the data that hides or scrambles sensitive fields. The goal is not just to read data, but to make it safer and easier to use.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with real files and clear scoring rules. You can measure schema accuracy, PII detection accuracy, and how well your masked views preserve useful data. It connects to public data, privacy, and database automation, so the real-world value is easy to explain. You can also compare your method against simple baselines, which gives you a real research angle.

Research Questions

How does tree-sitter parsing affect schema recovery accuracy on messy JSON compared with simple regex rules?
What is the effect of tiny LM tagging on PII detection precision and recall in open data files?
Does adding column-name context improve type inference for mixed CSV and JSON records?
To what extent does a masked-view layer preserve query usefulness after PII fields are hidden?
Which features, text patterns, structural cues, or value samples, matter most for inferring table schemas from noisy files?
How does the system performance change when files contain missing values, nested objects, or inconsistent keys?
What is the effect of different anonymization rules on the balance between privacy protection and downstream analysis quality?

Basic Materials

Laptop or desktop computer with Python support.
Sample messy CSV and JSON files from data.gov.
Text editor or code editor such as VS Code.
Python libraries for parsing and data handling, such as pandas and json.
Local PostgreSQL installation or a free hosted PostgreSQL instance.
Spreadsheet software for manual error checking and comparison.
A small labeled dataset of columns for schema and PII testing.

Advanced Materials

Linux workstation or server with enough memory to process many files.
PostgreSQL with extensions for view management and testing.
Tree-sitter parsing libraries and language grammars for JSON and related text formats.
Open-source small language model or embedding model for tagging columns.
Annotation tool for creating gold labels for schema and PII classes.
Docker or a virtual environment for repeatable runs.
Statistical testing tools for comparing model versions and baselines.

Software & Tools

Python: Runs your parsing, tagging, evaluation, and database automation code.
PostgreSQL: Stores the inferred tables and builds masked views for testing.
pandas: Helps you inspect messy tabular data and compare inferred columns to labels.
tree-sitter: Parses structured text so your system can recover nested fields and syntax cues.
scikit-learn: Scores classification and compare baseline models for schema and PII detection.

Experiment Steps

Define one file family to target first, such as CSV exports, nested JSON, or mixed government datasets.
Decide which outputs you will score, such as column type, table shape, and PII label.
Build a baseline that uses simple rules so you have something fair to compare against.
Add structural parsing and text tagging, then plan how each signal will vote on the final schema.
Design a masked-view policy that hides sensitive fields without breaking common queries.
Set up evaluation metrics that separate schema accuracy, privacy accuracy, and downstream usefulness.

Common Pitfalls

Treating every string column as text, which makes dates, IDs, and codes look correct when they are not.
Testing only on clean files, which hides failure on nested JSON, inconsistent headers, and missing values.
Measuring PII detection with accuracy alone, which can look good even when rare sensitive fields are missed.
Building a mask that removes too much detail, which makes the database useless for analysis.
Comparing your model to a weak baseline, which makes the results hard to trust.

What Makes This Competitive

A competitive version of this project would compare several inference strategies on the same messy datasets and score them with careful metrics. You could separate schema recovery, PII detection, and query usefulness instead of reporting one vague score. A stronger entry would also test hard cases, such as nested fields, mixed formats, and rare sensitive data. If you add a clean ablation study, you can show which part of the system really helps.

Project Variations

Focus only on CSV exports from government open data and compare schema inference on clean versus messy column headers.
Switch to healthcare-style JSON records and test whether structure-aware parsing improves privacy detection.
Replace the tiny LM tagger with a rules-only classifier, then measure how much performance you lose or gain.

Learn More

PostgreSQL Documentation: Read the official docs for CREATE TABLE, views, and data types at the PostgreSQL website.
PubMed: Search for review articles on de-identification, privacy-preserving data publishing, and structured health data.
NIH Data Sharing and Privacy resources: Find guidance on protecting sensitive data in research datasets at NIH pages.
data.gov: Browse open CSV and JSON datasets to build a realistic test set for schema inference.
USGS and NOAA data portals: Use their open datasets to practice on messy, real-world government files.
MIT OpenCourseWare, Database Systems: Find lecture notes and assignments on database design, schemas, and query processing.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →