How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Computational biology used to live inside university clusters with budgets most schools never see. Today, the same models, the same genomes, and the same simulation engines run on a free Google Colab tab from your bedroom.

This guide is your starting point. It covers the three things you need to begin: the small home kit that turns your laptop into a science workstation, the free software professional labs actually use, and the public datasets that count as real research data.

Why This Is Possible Now

Three shifts in the last decade changed what a high school student can do.

First, public biological data exploded. Every major genome, protein structure, single-cell atlas, drug bioactivity table, and patient-level summary statistic is now downloadable for free. You can pull the same TCGA tumor data, the same UK Biobank summary stats, and the same AlphaFold structures that a graduate student pulls.

Second, free GPU compute became normal. Google Colab and Kaggle give you a real NVIDIA GPU for several hours a day at no cost. That is enough to fold proteins with ColabFold, dock ligands with DiffDock, fine-tune a transformer, or run a molecular dynamics trajectory in OpenMM.

Third, the heavy software went open source. AlphaFold, ESM2, RDKit, GROMACS, OpenMM, PyTorch, scikit-learn, NEURON, Brian2, Mesa, NetLogo, COBRApy, and DeepChem are all free to install. Twenty years ago, each of these would have cost a license fee or required a lab affiliation.

Put together: a laptop on a kitchen counter plus a Colab tab now equals a working computational biology workstation.

The Computational Biology Home Kit

Most of your "kit" is software, but a small physical setup makes your project measurable and presentable.

Workstation basics

A laptop with at least 8 GB RAM (any modern Chromebook, Mac, or Windows machine works for Colab-first projects).
A free Google account for Colab and Drive storage.
A second monitor or tablet for reading papers while you code (optional but useful).

Optional wet-lab and citizen-science add-ons

Clip-on smartphone microscope lens (~$10) for pond-water or microorganism imaging.
Smartphone tripod and a windowsill timer setup for time-lapse photography of cultures.
Kitchen-safe culture materials: kombucha SCOBY, baker's yeast, sprouted seeds, garden soil.
Kitchen scale, balloons, and a metric tape measure for fermentation rate or CO₂ displacement measurements (~$15 total).
Citizen-science accounts on iNaturalist, eBird, FoldIt, and Eterna (free).

Optional consumer biosensors

A consumer EEG headset such as Muse or OpenBCI Ganglion (under $200) for attention, sleep, or focus studies.
A smartphone heart-rate or pulse-ox app paired with a notebook for physiology pilot data.
A USB microphone for bioacoustic recording (mosquito wingbeats, bird calls, voice samples).

Lab notebook stack

A bound notebook or a Notion/Obsidian vault for daily entries.
A free GitHub account for version-controlling code and saving notebooks.

Total cost for everything above, if you buy the optional pieces: roughly $0 to $250, and most strong projects use less than $50 of physical hardware.

The Signature Technique: Running Professional Pipelines on a Free Colab GPU

Computational biology's signature move is taking a tool that used to need a server room and running it from a browser tab. Once you can do this, the rest of the field opens up. Here is the five-step workflow.

Open a fresh Colab notebook at colab.research.google.com and switch the runtime to GPU under Runtime, Change runtime type.
Mount Google Drive so your inputs, model weights, and outputs persist between sessions. One cell, two lines.
Install the tool with pip or conda inside the notebook. ColabFold, OpenMM, RDKit, DeepChem, AutoDock Vina, and Hugging Face transformers all install in a few minutes.
Run on a small test input first. Fold one short protein. Dock one ligand. Simulate one short trajectory. Confirm the output makes sense before scaling up.
Scale to your real dataset and save outputs to Drive. Log runtimes, parameters, and random seeds in your notebook so the run is reproducible.

This loop is how you fold a protein, run a molecular dynamics simulation, train a graph neural network, or build an epidemiological forecasting pipeline. The same five steps cover every dry-lab tool below.

The Dry-Lab Side: Free Software You Can Install Today

Structure and structural biology

PyMOL and ChimeraX for viewing and annotating protein structures.
ColabFold for AlphaFold2 protein structure prediction in a browser.
ESMFold (via the ESM2 library) for fast single-sequence structure prediction.
Boltz-1 for newer open-source structure prediction including complexes.

Docking and drug design

AutoDock Vina and Smina for classical ligand docking.
DiffDock for diffusion-model-based pose prediction.
RDKit for cheminformatics, molecular descriptors, and SMILES handling.
DeepChem for chemistry-focused machine learning pipelines.
REINVENT and fragment-based generators for de novo molecular design.

Molecular dynamics and biophysics

OpenMM for GPU-accelerated molecular dynamics on Colab.
GROMACS for full-feature classical simulations.
Martini (in OpenMM) for coarse-grained membrane and nanoparticle simulations.

Systems biology and simulation

COBRApy for genome-scale metabolic flux balance analysis.
Tellurium and COPASI for ODE-based pathway modeling.
PhysiCell for multicellular tissue-scale simulations.
Mesa (Python) and NetLogo for agent-based models.
Gillespie SSA in Python for stochastic biochemical simulations.

Neuroscience modeling

NEURON and Brian2 for biophysical and spiking neural network models.

Machine learning core stack

scikit-learn for classical ML and statistics.
PyTorch and JAX for deep learning.
PyTorch Geometric for graph neural networks on molecules and biological networks.
Hugging Face for transformers, foundation models, and pretrained checkpoints.

Bioinformatics pipelines

Snakemake and Nextflow (with nf-core templates) for reproducible workflows.
Biopython and scanpy for sequence and single-cell data wrangling.

Running these on your own machine, not just reading about them, changes how research feels. You stop being a student of biology and start being a user of it.

Public Databases That Count as Real Data

Sequence and gene annotation

NCBI for GenBank, RefSeq, SRA, and almost every public sequence ever deposited.
Ensembl for genome browsers and comparative genomics.
UniProt for protein sequences and functional annotation.
Pfam and InterPro for protein domains and families.
KEGG for pathways and metabolic networks.

Expression and single-cell atlases

GTEx for human tissue gene expression.
GEO and Expression Atlas for thousands of public expression studies.
Human Cell Atlas and Tabula Sapiens for single-cell reference maps.
ENCODE and Roadmap Epigenomics for regulatory element tracks.

Variation and population genetics

1000 Genomes for global human variation.
gnomAD for allele frequencies across populations.
UK Biobank summary statistics for GWAS-scale phenotype-genotype links.

Cancer and disease genomics

TCGA (via cBioPortal and GDC) for multi-omic tumor data.

Structures and drug-target data

PDB for experimentally determined protein structures.
AlphaFold DB for predicted structures of nearly every known protein.
ChEMBL, DrugBank, PubChem, and BindingDB for compound bioactivity.
ZINC for purchasable molecular libraries.

Clinical and pharmacology

FAERS for adverse-event reports.
Stanford HIVdb for resistance mutations.
MIMIC-IV for de-identified ICU data (with a free credentialing step).

Epidemiology and public health

WHO, CDC WONDER, and CDC NWSS for disease and wastewater surveillance.
OWID COVID and Johns Hopkins CSSE archives.
HealthData.gov for U.S. open health datasets.
Google Trends and Google Community Mobility archives for behavioral signal.

Environmental and ecological

ERA5 climate reanalysis, NASA land-surface temperature, NLCD land cover.
iNaturalist and eBird for citizen-science species occurrence.
OpenStreetMap for geographic context.

Re-analyzing a public dataset with a new method is itself a legitimate research path. Some of the strongest ISEF computational projects never generate a single new data point.

How to Combine Wet and Dry: The Strongest Project Shape

Pattern A: Household measurement calibrates a simulation. Grow something simple at home (a SCOBY, sprouted seeds, yeast fermenting in a balloon, pond-water microbes on a smartphone microscope). Measure it over time. Then fit a model (agent-based, ODE, flux balance, or reaction-diffusion) to your measurements and use the model to predict conditions you did not test.

Pattern B: Public data trains a model, your measurements stress-test it. Train a classifier, forecaster, or structure predictor on a public dataset. Then collect a small original dataset at home (citizen-science observations, smartphone recordings, survey responses) and evaluate how the model generalizes. Failure modes are the interesting finding.

Judges respond to this hybrid shape because it shows you can both write code and touch the real system the code claims to describe.

Choosing a Phenomenon That Has Not Been Done

Originality is a process, not a guess. Three steps:

Google Scholar. Search your candidate question in quotes plus the method you plan to use. Read the first two pages of results. If someone has done your exact study, refine the angle (different population, different method, different metric).
Society for Science abstracts archive. Search the public ISEF and Regeneron STS abstract databases for your keywords. This shows you what other high school students have already pitched.
PubMed. Filter to review articles from the last five years on your topic. Reviews tell you what the open questions are, which is exactly where your project should sit.

Finding closely related prior work is good news. It means the question is real, the data exists, and you can position your project as the next adjacent step.

A Realistic Timeline

One to two weeks: Pick one public dataset and one tool. Reproduce a published analysis end-to-end and write up what you learned.
One to two months: Full hybrid project for a regional fair. One model, one original measurement or one new dataset slice, a clean figure set, and a poster.
Full year: ISEF-track project. Multiple methods compared, ablations, a held-out validation set, and a writeup that reads like a short paper.

If this is your first research project, start with the one-to-two-week version. Finishing a small project teaches you more than half-finishing a big one.

A Starter Checklist

A quiet workspace with your laptop, charger, and notebook in one spot.
A free Google account with Colab opened and a GPU runtime tested.
A local Python environment (Anaconda or Miniconda) with NumPy, pandas, scikit-learn, PyTorch, RDKit, and Biopython installed.
PyMOL or ChimeraX installed for any structure-related work.
A GitHub account with one empty repository named after your project.
A lab notebook (paper or digital) with the date on the first page.
A single sentence at the top of page one: your one-line research question.

If you have those seven things, you are ready to pick a phenomenon.

Where to Go Next

ISEF organizes computational biology into seven subcategories. Each has its own MehtA+ project guide that builds directly on the kit and tools above. Pick the one that pulls you in.

Computational Biomodeling (MOD) — simulating biological systems with agent-based, ODE, PDE, or stochastic models.
Computational Epidemiology (EPD) — forecasting and analyzing disease spread with public health, climate, and behavioral data.
Computational Evolutionary Biology (EVO) — phylogenetics, selection scans, and comparative genomics across species.
Computational Neuroscience (NEU) — modeling neurons, networks, and brain data from EEG, fMRI, and connectomes.
Computational Pharmacology (PHA) — drug discovery, docking, pharmacokinetics, and adverse-event analysis.
Genomics (GEN) — sequence analysis, variant interpretation, single-cell, and foundation models for DNA and RNA.
Other (OTH) — cross-cutting projects in federated learning, multi-omics integration, bioacoustics, and AI-for-bio tooling.

The same laptop, the same Colab tab, and the same public databases support every one of these. Computational biology research from your kitchen counter is not a workaround. It is the real thing.

Project ideas in this category (70)

ABCD fMRI Social Media and Anxiety Patterns

Computational Biology and Bioinformatics · Computational Neuroscience · Advanced

AI School Policy Search for Disease Spread

Computational Biology and Bioinformatics · Computational Epidemiology · Advanced

AlphaFold Protein Complex Modeling with Cross-Link Data

Computational Biology and Bioinformatics · Other · Advanced

Ancestral Antifreeze Protein Evolution

Computational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced

Asthma Polygenic Risk Across Ancestries

Computational Biology and Bioinformatics · Genomics · Advanced

Bayesian Models of Optical Illusions

Computational Biology and Bioinformatics · Computational Neuroscience · Advanced

Blood-Brain Barrier Prediction for Antipsychotic Leads

Computational Biology and Bioinformatics · Computational Pharmacology · Advanced

C. elegans Connectome Simulation Project

Computational Biology and Bioinformatics · Computational Neuroscience · Advanced

Causal Discovery in Sleep, Glycemia, and Mood

Computational Biology and Bioinformatics · Other · Advanced

Circadian Splicing in Human Tissues

Computational Biology and Bioinformatics · Genomics · Advanced

Comparative Genomics of Longevity Gene Loss

Computational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced

CRISPR Base Editing Outcome Prediction

Computational Biology and Bioinformatics · Genomics · Advanced

CRISPR Off-Target Binding With Gillespie Models

Computational Biology and Bioinformatics · Computational Biomodeling · Advanced

Decoding Imagined vs. Heard Speech in fMRI

Computational Biology and Bioinformatics · Computational Neuroscience · Advanced

Dengue Outbreak Forecasting With Public Data

Computational Biology and Bioinformatics · Computational Epidemiology · Advanced

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →