AI Query Optimization for DuckDB

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Databases · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A bad query plan can turn a fast database into a snail. One wrong guess about join size can make the engine pick the wrong path and waste huge amounts of time. Your project can test how much small AI models help DuckDB avoid that mistake. You can even measure whether better estimates lead to better plans.

What Is It?

A database query optimizer picks the order and method for answering a SQL query. Think of it like a GPS for data. If the GPS thinks one road is empty when it is packed, you get stuck in traffic. If the optimizer thinks one join will be tiny when it is huge, the whole query can slow down.

Cardinality means how many rows a step will produce. A cardinality model tries to predict that number from table statistics, like column ranges, value counts, and correlations. The AI-native part means the model learns from past data instead of using only hand-written rules. The lower-bound guarantee part means the system also tries to avoid estimates that are too optimistic in a dangerous way, which can help prevent terrible plan choices.

Why This Is a Good Topic

This topic gives you a real systems problem with clear numbers to measure. You can compare estimated row counts, query runtimes, and plan choices across different models or database statistics. The project connects to a real need in data systems, since better optimization saves compute time and money. You can learn query planning, statistical modeling, and experimental design without needing a giant lab setup.

Research Questions

How does a small calibrated cardinality model change DuckDB's join order choices on benchmark queries?
What is the effect of per-database training data versus pooled training data on cardinality prediction error?
Does adding lower-bound constraints reduce catastrophic underestimates in join size prediction?
To what extent do better cardinality estimates improve total query runtime on skewed tables?
Which table statistics, such as distinct counts, histograms, or value correlations, most improve plan quality?
How does model size affect the tradeoff between prediction accuracy and optimizer speed?

Basic Materials

Laptop or desktop computer.
DuckDB installed locally.
SQL dataset with multiple related tables.
Python.
Pandas.
NumPy.
Jupyter Notebook.
Version control tool like Git.
Spreadsheet software for logging query results.

Advanced Materials

University workstation or lab server with enough memory for benchmark runs.
DuckDB source build with extension support.
JOB (Join Order Benchmark) dataset or a comparable join-heavy benchmark.
Python.
Pandas.
NumPy.
Scikit-learn.
PyTorch or JAX.
PostgreSQL or another reference database for comparison.
Profiling tools for query timing and plan inspection.
Statistics package for hypothesis tests.

Software & Tools

DuckDB: Runs local SQL queries quickly and lets you compare plans and runtimes.
Python: Cleans query logs, trains simple prediction models, and graphs results.
Pandas: Organizes benchmark outputs into tables for analysis.
Jupyter Notebook: Keeps your experiments, code, and notes in one place.
EXPLAIN ANALYZE: Shows the optimizer's chosen plan and the actual execution cost.

Experiment Steps

Define the exact query family you will test, then decide which tables, joins, and data shapes matter most.
Build a baseline that uses DuckDB's normal optimizer, so you have a fair point of comparison.
Choose the database statistics you will feed into your model, then separate training data from test queries.
Train a small estimator that predicts join cardinalities, then add a rule that keeps estimates from becoming unrealistically low.
Compare estimated row counts, chosen join orders, and runtime across baseline and model-backed runs.
Analyze where the model helps, where it hurts, and which statistics explain the biggest changes.

Common Pitfalls

Training and testing on the same query templates, which makes the model look better than it really is.
Ignoring skewed tables, which hides the hardest cases where join estimates fail.
Comparing runtimes without fixing cache state, which can blur the effect of the optimizer.
Measuring only prediction error and not plan quality, which misses the real database outcome.
Changing too many model features at once, which makes it impossible to tell what actually improved the optimizer.

What Makes This Competitive

A strong version of this project goes past simple accuracy scores. You would compare not just estimated row counts, but also downstream plan choices and runtime on unseen queries. The best projects test difficult cases, like skew, correlation, and cross-database transfer. A top entry also uses careful statistics, clear baselines, and a clean story about why the optimizer improves or fails.

Project Variations

Test whether the same calibrated model works on a different database engine, such as PostgreSQL, and compare plan changes.
Train separate models for skewed and non-skewed tables, then see whether specialization improves lower-bound accuracy.
Replace learned features with hand-built histogram features, then measure whether simpler statistics can compete.

Learn More

DuckDB documentation: Read the official docs for query plans, EXPLAIN, and extension development, then find it on the DuckDB website.
Join Order Benchmark paper: Search the paper in Google Scholar or arXiv to understand the benchmark used for join optimization.
PostgreSQL documentation: Read planner and statistics documentation on the official PostgreSQL site for a second database perspective.
MIT OpenCourseWare database systems materials: Find free lectures and notes on database query processing and optimization.
USENIX and VLDB papers: Search these conference proceedings for recent work on learned query optimization and cardinality estimation.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →