Verifiable Natural Language to SQL Systems

Verifiable Natural Language to SQL Systems

ISEF Category: Systems Software

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Databases  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

LLMs can write SQL, but they can also be confidently wrong. If a query touches grades, money, or health data, a wrong answer can cause real damage. Your project asks a sharp question, can you prove a query is right instead of just guessing?

What Is It?

This phenomenon is about turning a plain-English question into SQL, then attaching proof that helps another system check the query. Think of SQL as the recipe, and the proof certificate as the grocery receipt plus a step-by-step cooking log. The proof can include a logical plan, which is the structure of the query, and provenance, which tracks where each result came from.

That extra layer matters because LLMs can produce SQL that looks right but fails on edge cases. A verification layer gives you something closer to a safety net. You are not just asking, "Can the model write SQL?" You are asking, "Can the model explain why this SQL should be trusted, and can a checker confirm it without relying on the same LLM?"

Why This Is a Good Topic

This is a strong science fair topic because you can test clear, measurable outcomes. You can compare plain LLM SQL against LLM SQL plus verification, then measure exact-match accuracy, execution accuracy, and how well your confidence score matches real correctness. The topic connects to databases, AI safety, and trustworthy software, which gives it real-world weight. A student can learn query parsing, evaluation design, error analysis, and basic provenance ideas without building a full production system.

Research Questions

  • How does adding a proof certificate change SQL correctness on Spider and BIRD?
  • What is the effect of provenance checks on catching hallucinated table or column references?
  • Does a correctness-vs-confidence metric predict wrong SQL better than raw LLM confidence?
  • To what extent does logical-plan verification reduce execution errors on nested joins and subqueries?
  • Which error types, such as schema mismatch, join mistake, or aggregation error, benefit most from certificate checking?
  • How does the system perform when the natural-language prompt is ambiguous versus specific?

Basic Materials

  • Laptop with Python support and enough storage for dataset files.
  • Access to Spider benchmark data and BIRD benchmark data.
  • A text editor or notebook environment such as JupyterLab.
  • SQLite or DuckDB for local query execution.
  • Basic spreadsheet software for tracking results.
  • A small set of database schemas for sanity checks.
  • Access to an LLM API or local model for SQL generation.

Advanced Materials

  • Linux workstation or server with GPU access for model inference experiments.
  • PostgreSQL or another full SQL engine for deeper execution testing.
  • Query planner inspection tools or explain-plan output from the database engine.
  • Python libraries for parsing SQL and building logical-plan representations.
  • A provenance tracking library or custom event logging pipeline.
  • Dataset annotation tool for labeling error types.
  • Version-controlled experiment tracking system.

Software & Tools

  • Python: Runs parsing, evaluation, and metric scripts for query generation and verification.
  • JupyterLab: Lets you explore failures, inspect outputs, and compare runs side by side.
  • DuckDB: Executes SQL quickly on local copies of benchmark tables.
  • SQLite: Provides a simple engine for testing query syntax and logic on small schemas.
  • ImageJ: Not relevant for this topic, so skip it and use code-based evaluation instead.

Experiment Steps

  1. Define the exact trust problem you want to test, such as syntax correctness, execution correctness, or user confidence.
  2. Choose one baseline text-to-SQL system and one verification design so you can compare them fairly.
  3. Build a scoring scheme that separates correct SQL, executable SQL, and provably supported SQL.
  4. Select benchmark queries and sort them by difficulty, schema complexity, and ambiguity.
  5. Design failure labels for common errors so you can analyze where proof certificates help most.
  6. Plan one confidence metric that combines model output with verifier output, then test whether it matches real correctness.

Common Pitfalls

  • Using exact-match score alone, which misses queries that are different in wording but equal in result.
  • Testing only easy schema questions, which hides failures on joins, nesting, and aggregation.
  • Letting the verifier see the same hidden reasoning as the generator, which makes the proof less independent.
  • Ignoring schema ambiguity, which can make a wrong query look correct on a small sample.
  • Mixing execution accuracy with trust calibration, which blurs whether the proof helps correctness or only confidence.

What Makes This Competitive

A class-level version of this project might only compare one model against another. A stronger version adds a real verifier, clear failure categories, and a metric that measures whether confidence matches truth. You can stand out by testing across multiple schema styles, not just one dataset slice. Careful ablation studies, where you remove one proof component at a time, can make your results much stronger.

Project Variations

  • Test the system on biomedical databases instead of benchmark text-to-SQL sets to see how schema complexity changes verification.
  • Compare two proof styles, logical-plan only versus logical-plan plus provenance, to see which better predicts correctness.
  • Focus on uncertainty calibration by comparing model confidence, verifier confidence, and final query success on hard versus easy questions.

Learn More

  • Spider dataset: Search for the official Spider benchmark paper and dataset page, which includes text-to-SQL schemas and evaluation details.
  • BIRD benchmark: Search the BIRD text-to-SQL benchmark paper and project page for harder, more realistic database tasks.
  • Foundations of Databases by Abiteboul, Hull, and Vianu: A classic database theory text, often available through school or library access.
  • MIT OpenCourseWare Database Systems: Free lecture notes and assignments for SQL, query processing, and relational algebra.
  • PubMed: Search for review articles on provenance in data management and trustworthy AI systems.
  • Proceedings of VLDB or SIGMOD: Search recent peer-reviewed papers on text-to-SQL, provenance, and query verification.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart