Hybrid Vector Search in SQLite

Hybrid Vector Search in SQLite

ISEF Category: Systems Software

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Databases  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

Search engines feel instant because they do more than one thing at once. This project asks a hard question, can a database handle both similarity search and filters without slowing to a crawl? If you can make that work, you are exploring the same problem behind modern AI search and recommendation tools.

What Is It?

This phenomenon combines two ways of finding data. One way is vector search, which groups items by meaning or similarity. The other way is relational search, which uses filters like author, date, or category. Think of it like looking for a book in a library by both topic and shelf label at the same time.

SQLite usually excels at structured queries, while vector indexes like HNSW help with nearest-neighbor search, which means finding the most similar items. The idea here is to connect them so the database can narrow results with filters and similarity together, instead of searching first and filtering later. That can save time and improve answer quality when the filter rules matter a lot.

Your project would study whether a hybrid design really beats a two-step approach. You would measure speed, recall, and maybe memory use on a dataset such as Wikipedia abstracts with metadata. The core lesson is about tradeoffs, not just speed. Better indexing can change what answers a system can return, not only how fast it runs.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with clear numbers. You can compare query speed, recall, and memory use across different index designs and filter settings. The real-world link is strong, since search tools, chatbots, and recommendation systems all need fast filtered retrieval. You can also learn database design, benchmarking, and evaluation, which are useful skills for computer science research.

Research Questions

  • How does a hybrid SQLite index change query latency compared with post-filtering after k-NN search?
  • What is the effect of filter selectivity on recall when you search by vector first and filter second?
  • Does jointly maintaining HNSW and B+-tree pointers reduce the number of candidate records scanned per query?
  • To what extent does hybrid indexing improve performance across different Wikipedia topic categories?
  • Which index update strategy keeps search speed stable as the dataset grows?
  • How does the hybrid approach affect memory use compared with separate vector and relational indexes?

Basic Materials

  • A laptop or desktop computer with enough storage for a local SQLite dataset.
  • SQLite with a vector extension or custom build that supports your chosen index design.
  • A public dataset of text items with metadata, such as Wikipedia abstracts and tags.
  • Python for data cleaning, query running, and logging results.
  • Jupyter Notebook or a similar notebook tool for analysis and plots.
  • A spreadsheet or CSV viewer for checking sample records and outputs.
  • A stopwatch or automated timing library for repeated query tests.
  • Git for version control and experiment tracking.

Advanced Materials

  • A server or high-memory workstation for larger benchmark runs.
  • Access to SQLite source code for custom index experimentation.
  • A Linux environment for reproducible builds and performance testing.
  • A profiler such as perf or cProfile for bottleneck analysis.
  • A benchmark harness for repeated randomized query evaluation.
  • A vector search library or implementation of HNSW for comparison testing.
  • A separate database engine for baseline comparisons, if available.
  • A larger text corpus with richer metadata fields for stress testing.

Software & Tools

  • Python: Runs benchmarks, cleans datasets, and calculates recall and latency metrics.
  • SQLite: Stores structured fields and supports the database queries you test.
  • Jupyter Notebook: Helps you inspect results, make plots, and compare index designs.
  • ImageJ: Not needed for this topic, so skip it unless you also visualize non-text data.
  • Git: Tracks code changes and lets you roll back failed index experiments.

Experiment Steps

  1. Define one query pattern that mixes similarity search with metadata filters, then decide what counts as success.
  2. Choose a baseline design, such as vector search followed by a filter, so you have something to beat.
  3. Plan the index variant you will test, including how vector links and B+-tree lookup paths will work together.
  4. Build an evaluation set that includes different filter strengths, dataset sizes, and topic groups.
  5. Decide which metrics matter most, such as latency, recall, memory use, and update cost.
  6. Set up repeated runs with randomized query order so your results are not just noise.

Common Pitfalls

  • Using a query set where the filters are too broad, which hides any benefit from the hybrid index.
  • Comparing results with different recall thresholds, which makes one approach look faster only because it returns fewer true matches.
  • Forgetting to separate indexing time from query time, which blurs build cost with search cost.
  • Testing on too small a dataset, which can make HNSW and B+-tree overhead look meaningless.
  • Ignoring update behavior, which leaves you with a fast read index that falls apart when records change.

What Makes This Competitive

A class-level version of this project just shows that one query method is faster. A stronger version explains why, with careful controls and several benchmark conditions. You can raise the level by testing different filter selectivities, dataset sizes, and update patterns, then analyzing both recall and latency. A novel comparison or a cleaner index design can make the work feel like real systems research, not just a speed test.

Project Variations

  • Test the same hybrid index on news headlines with topic and date filters instead of Wikipedia abstracts.
  • Compare HNSW plus B+-tree pointers against a pure inverted index plus metadata filter pipeline.
  • Measure whether the hybrid design helps more on rare-category queries than on common-category queries.

Learn More

  • SQLite Documentation: Read the official query planner and indexing docs on the SQLite website.
  • PubMed: Search for review articles on vector retrieval, information retrieval, and database search evaluation if you want a broader methods background.
  • arXiv: Search for recent papers on hybrid search, approximate nearest neighbors, and filtered vector retrieval.
  • MIT OpenCourseWare, Introduction to Algorithms: Review search, trees, and graph basics through the open course materials.
  • NIST Engineering Statistics Handbook: Use the free guidance on benchmarking, uncertainty, and comparing performance measurements.
Shopping Cart