LLM Bioinformatics Pipeline Generation

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A single bad pipeline can turn good RNA-seq data into junk. That makes workflow design a real scientific problem, not just a coding task. Your project can test whether an LLM agent writes pipelines that work, fail, or need human repair.

What Is It?

This project asks whether an LLM agent can build a bioinformatics pipeline that processes RNA-seq data correctly. RNA-seq measures which genes are active. A pipeline is the chain of software steps that turns raw sequencing files into results you can trust.

Think of it like a kitchen assembly line. Each tool has one job, such as trimming reads, aligning them to a genome, or counting gene expression. If one station makes a mistake, the final dish looks fine but tastes wrong. Your job is to see how close the AI gets to a known-good recipe.

The critique part matters too. A strong agent should not just write code. It should spot missing inputs, bad file paths, weak version control, and steps that break reproducibility. That gives you a way to study both generation and self-checking.

Why This Is a Good Topic

This is a strong science fair topic because you can test clear outputs, like whether a pipeline runs, whether it matches a gold standard, and how often it needs human fixes. You also connect to a real problem in biology, since reproducible workflows affect drug discovery, gene studies, and public datasets. You can learn coding, workflow design, error analysis, and scientific reproducibility without inventing a brand-new wet-lab method.

Research Questions

How does the prompt structure affect whether an LLM agent produces a runnable RNA-seq workflow?
What is the effect of adding nf-core examples on pipeline correctness and reproducibility?
Does a critique step reduce missing dependencies or broken file paths in generated workflows?
To what extent do generated Snakemake and Nextflow pipelines match curated gold standards?
Which error types appear most often in LLM-generated bioinformatics pipelines?
How does dataset complexity affect the number of human edits needed before a pipeline runs?

Basic Materials

A laptop or desktop computer with enough storage for workflow files and logs.
Internet access for downloading public RNA-seq test datasets and documentation.
Python installed with a code editor such as VS Code or PyCharm Community Edition.
Snakemake and Nextflow installed locally or in a containerized environment.
Git for version tracking and change history.
Access to public RNA-seq datasets from NCBI GEO or the Sequence Read Archive.
Curated nf-core pipeline documentation for comparison.
Spreadsheet software for tracking errors, edits, and run status.

Advanced Materials

A workstation or lab server with Linux and multiple CPU cores.
Conda or Mamba for controlled software environments.
Docker or Singularity for reproducible workflow execution.
A local or cloud-hosted large language model setup, if available through a university lab.
Bash scripting tools for automated pipeline checks.
RNA-seq benchmark datasets with known expected outputs.
Reference genomes and annotation files matched to the test datasets.
Job scheduler access such as SLURM for repeatable pipeline runs.

Software & Tools

Python: Parses logs, tracks errors, and compares generated workflows against gold standards.
Snakemake: Runs rule-based workflows that you can test for correctness and reproducibility.
Nextflow: Runs workflow pipelines and helps you compare another common bioinformatics framework.
GitHub: Stores versions of prompts, generated pipelines, and edit history.
RStudio: Summarizes failure rates, edit counts, and benchmark results with plots and tables.

Experiment Steps

Define one RNA-seq task, such as alignment or differential expression, so your comparison stays focused.
Choose the benchmark workflow you will treat as the gold standard and list the exact features you will score.
Design prompt versions that differ in one factor, such as examples, critique, or tool instructions.
Set up a scoring plan that checks whether the generated pipeline runs, matches expected outputs, and stays reproducible across repeated runs.
Plan how you will record human intervention, including code edits, missing dependencies, and fixes to file handling.
Decide how you will compare workflow families, such as Snakemake versus Nextflow, using the same dataset and scoring rules.

Common Pitfalls

Comparing pipelines that solve different RNA-seq tasks, which makes your benchmark unfair.
Using datasets with missing metadata or bad file structure, which causes failures that are not the model's fault.
Scoring only whether code looks correct, which misses runtime errors and hidden reproducibility problems.
Changing too many prompt variables at once, which makes it hard to tell what caused each improvement or failure.
Forgetting to standardize software versions, which can make the same workflow behave differently across runs.

What Makes This Competitive

A competitive version of this project would measure more than pass or fail. You could score exact error types, recovery after critique, and how much human repair each workflow needs. You could also compare two workflow systems, then test whether the agent handles one better than the other. Strong logging, clean benchmarks, and careful statistics would make the project feel much more like real methods research.

Project Variations

Test the agent on a different omics task, such as variant calling instead of RNA-seq.
Compare pipelines generated for Snakemake, Nextflow, and plain Bash scripts to see which format the agent handles best.
Add a second analysis layer that checks whether the agent's workflow choices preserve reproducibility across different compute environments.

Learn More

nf-core documentation: Read community standards for curated bioinformatics workflows and find them through the nf-core website.
Snakemake documentation: Learn workflow syntax, rule design, and execution patterns from the official Snakemake docs.
Nextflow documentation: Study pipeline structure and reproducibility features in the official Nextflow docs.
NCBI GEO: Search public RNA-seq datasets and sample metadata through the Gene Expression Omnibus database.
PubMed: Search review articles on RNA-seq workflow reproducibility, benchmarking, and automated pipeline design.
MIT OpenCourseWare: Find free computer science and data analysis materials that help with scripting, version control, and experimental design.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →