Cache-Oblivious Vector Search on a Laptop

Cache-Oblivious Vector Search on a Laptop

ISEF Category: Systems Software

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Algorithms  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A search engine can be fast and still waste a huge amount of memory. That matters when your data no longer fits in RAM. Your project asks a sharp question, can smart layout beat a famous graph index without needing a giant server? You will test that tradeoff on real embedding datasets.

What Is It?

Approximate nearest-neighbor search finds the closest items to a query without checking every item. Think of it like finding the nearest bookstores in a city without walking every street. Instead of perfect answers, you accept a tiny amount of error in exchange for speed and lower memory use.

This phenomenon combines two ideas. Cache-oblivious design means the data layout tries to work well with the computer's memory system without tuning for one exact cache size. Hilbert-curve locality means nearby points in high-dimensional space get placed near each other in memory more often, which can reduce slow memory jumps. Learned quantization means a model compresses vectors into smaller codes, so you store less while still keeping useful structure for search.

Why This Is a Good Topic

This makes a strong science fair topic because you can measure clear outputs, recall, query speed, and RAM use. You can compare your index against a known baseline and see real tradeoffs instead of only showing code that runs. The project connects to search, recommendation systems, and large-scale AI systems, where memory limits matter every day. You can also learn algorithm design, benchmarking, and statistical analysis.

Research Questions

  • How does Hilbert-curve ordering affect recall at fixed RAM use compared with a graph-based baseline? ?
  • What is the effect of learned quantization level on query speed and recall? ?
  • Does cache-oblivious layout reduce memory traffic enough to improve throughput on a single laptop? ?
  • To what extent does dataset type, such as image embeddings versus vector search benchmarks, change the best index settings? ?
  • Which combination of codebook size and neighborhood width gives the best recall-vs-RAM curve? ?
  • How does batch query size affect latency and memory behavior for the same index? ?

Basic Materials

  • Laptop with at least 16 GB RAM, preferably 32 GB or more.
  • Python 3.11 or later.
  • External SSD or large local storage for public embedding files.
  • Git for version tracking.
  • A spreadsheet or notebook for logging benchmark results.
  • Public embedding datasets such as SIFT1B subsets or LAION-derived embeddings.
  • Enough disk space to store indices, caches, and evaluation outputs.

Advanced Materials

  • Workstation or server access with 64 GB RAM or more.
  • Linux machine for repeatable benchmarking.
  • C++ compiler with optimization support.
  • Python environment for orchestration and analysis.
  • Profiling tools such as perf or Valgrind Cachegrind.
  • ANN benchmark libraries or reference implementations for comparison.
  • Large public embedding corpora, such as SIFT1B and LAION-400M embeddings or approved subsets.

Software & Tools

  • Python: Runs benchmarking scripts, data prep, and result analysis.
  • NumPy: Handles vector math and array operations for experiments.
  • Faiss: Provides baseline approximate nearest-neighbor methods for comparison.
  • pandas: Organizes recall, latency, and memory measurements into tables.
  • Matplotlib: Plots recall-vs-RAM and speed-vs-accuracy curves.

Experiment Steps

  1. Define the exact search task you will optimize, including vector type, query set, and what counts as a correct near neighbor.
  2. Choose one baseline index and one memory-aware design so you can compare both under the same evaluation rules.
  3. Decide how you will measure quality, speed, and memory, then keep those metrics fixed across all tests.
  4. Build a benchmark plan that separates preprocessing time, index size, and per-query performance.
  5. Test a small set of hyperparameters first, then narrow to the most promising settings for a deeper comparison.
  6. Analyze tradeoffs with plots that show where your method wins and where it loses.

Common Pitfalls

  • Comparing methods on different hardware settings, which makes speed and memory numbers unfair.
  • Measuring only recall and ignoring RAM use, which misses the whole point of the index design.
  • Using a tiny sample of vectors, which hides cache behavior and makes the results unstable.
  • Letting preprocessing time get mixed into query latency, which makes the benchmark misleading.
  • Changing more than one design choice at once, which makes it hard to tell whether Hilbert ordering or quantization caused the effect.

What Makes This Competitive

A stronger version of this project would compare multiple index layouts under the same benchmark harness and report full tradeoff curves, not just one best point. You could test whether the memory savings still hold across different embedding families, query batches, and hardware limits. Strong entries also explain why the gains happen, using profiling or cache-miss data instead of only reporting recall. A novel analysis angle, like showing where locality breaks down, can make the work much more interesting.

Project Variations

  • Test the same index on image embeddings versus text embeddings to see whether data structure benefits depend on geometry.
  • Replace one quantization method with a simpler compression scheme and compare recall-vs-RAM curves.
  • Analyze batch query behavior to see whether the cache-aware layout helps more for single queries or grouped searches.

Learn More

  • FAISS documentation: Read the open-source library docs for practical nearest-neighbor search methods and baseline ideas, then find the project on GitHub.
  • Algorithms for Approximate Nearest Neighbor Search by open course notes: Search MIT OpenCourseWare or similar university notes for vector search and indexing lectures.
  • PubMed and Google Scholar: Search for review papers on approximate nearest-neighbor search, vector quantization, and cache-aware indexing.
  • arXiv: Search for recent preprints on ANN search, locality-sensitive layouts, Hilbert curves, and product quantization.
  • Papers With Code: Find benchmarked ANN methods and public datasets, then compare reported metrics with your own results.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart