Memory-Saving Genome K-mer Counting Project

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Algorithms · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A human genome has billions of DNA letters, but software still has to count tiny patterns in it fast. That is like sorting every word in a library by reading one letter at a time. If your data structure wastes memory, the whole analysis can crash before it finishes. Your project asks how to keep the answers while shrinking the storage.

What Is It?

K-mer counting means counting every short DNA string of length k in a genome. If k equals 31, each chunk is 31 letters long. Researchers use these counts for genome assembly, error checking, and species comparison. The hard part is scale. A full human reference has so many k-mers that a simple dictionary can eat huge amounts of RAM.

This project uses two ideas to shrink memory use. Minimizer-based partitioning groups similar k-mers into smaller buckets, so the program does not have to hold everything at once. Elias-Fano encoding is a compact way to store sorted numbers. Think of it like compressing a phone contact list by storing the pattern in the numbers instead of writing every digit in full. Together, these methods aim to keep the count structure small enough to fit in memory while still giving correct counts.

Why This Is a Good Topic

This topic is strong because you can measure real performance, not just build something that sounds clever. You can test memory use, runtime, accuracy, and scalability on public genome data. It connects to medicine, evolution, and bioinformatics pipelines, so the problem feels real. You can also learn algorithm design, data compression, benchmarking, and experimental comparison, which are all useful for competitive computing projects.

Research Questions

How does minimizer length affect memory use in genome-scale k-mer counting?
What is the effect of k-mer length on the speed of a succinct counting structure?
Does Elias-Fano encoding reduce memory more than a plain hash table for sorted k-mer counts?
To what extent does partition size change runtime when counting k-mers from large reference genomes?
Which input genome size causes the first major slowdown in a compact counting pipeline?
How does the compact method compare with KMC3 and Jellyfish on memory and throughput?

Basic Materials

Computer with at least 16 GB RAM, preferably more.
Access to a Linux environment or a Linux virtual machine.
Public genome FASTA files from NCBI or Ensembl.
Sample k-mer counting software such as KMC3 and Jellyfish.
Python for scripting runs and parsing output.
Spreadsheet software for logging runtime, memory, and accuracy.
Git for version control.
Basic command-line tools such as grep, awk, and time.

Advanced Materials

University workstation or server with access to large RAM allocations.
Linux cluster account for large-scale benchmarking.
Genome references and read sets from NCBI, ENA, or UCSC.
Reference implementations or papers describing Elias-Fano encoding.
Performance profiling tools such as perf and /usr/bin/time.
Python libraries for analysis, such as pandas, numpy, and scipy.
Optional C++ compiler toolchain for building and modifying algorithms.
Container tools such as Docker or Singularity if the lab allows them.

Software & Tools

Python: Organizes benchmark data, compares results, and makes plots.
R: Runs statistical tests and creates clean figures for your final report.
ImageJ: Not needed for this topic, so skip it unless you create visual summary graphics.
Jupyter Notebook: Keeps code, notes, and output together while you test ideas.
GitHub Desktop: Tracks code changes and helps you roll back bad edits.

Experiment Steps

Define the exact counting problem you will solve, including genome size, k-mer length, and output format.
Choose one compact encoding idea to test first, then decide what baseline method you will compare against.
Design your benchmark set so you can measure both memory footprint and runtime on data with different sizes.
Plan the controls that separate algorithm effects from hardware effects, such as using the same machine and input format.
Build a scoring plan for accuracy, compression ratio, throughput, and failure cases.
Decide how you will present the results, including tables, plots, and a short explanation of tradeoffs.

Common Pitfalls

Testing on only one genome, which makes the results look stronger than they really are.
Using different file formats for different methods, which mixes algorithm speed with parsing overhead.
Measuring peak RAM with inconsistent tools, which makes memory numbers impossible to compare.
Changing k and the minimizer settings at the same time, which hides the cause of any improvement.
Ignoring output verification, which can let a fast method look good even if it counts k-mers incorrectly.

What Makes This Competitive

A strong project goes past a simple speed comparison. You would test multiple genome sizes, report memory and runtime together, and explain why one method wins under certain settings. You would also compare accuracy and failure modes, not just average performance. If you add a careful analysis of partitioning or compression behavior, the project starts to look like real algorithm research.

Project Variations

Test the same compact counting idea on bacterial genomes, plant genomes, and the human reference to see how genome complexity changes performance.
Swap exact k-mer counting for canonical k-mers and measure whether that reduces memory without hurting biological usefulness.
Compare minimizer-based partitioning with another partitioning rule, then analyze which one gives the best balance of speed and compression.

Learn More

NCBI Genome Database: Find reference genomes and sequence files for benchmarking.
NIH PubMed: Search for review articles on k-mer counting, genome assembly, and succinct data structures.
MIT OpenCourseWare: Look for algorithms and data structures courses that cover hashing, compression, and performance analysis.
Google Scholar: Search for the original papers on KMC3, Jellyfish, minimizers, and Elias-Fano encoding.
US National Center for Biotechnology Information: Use the FTP and assembly resources to download public genomic datasets.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →