How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Computational biology used to live inside university clusters with budgets most schools never see. Today, the same models, the same genomes, and the same simulation engines run on a free Google Colab tab from your bedroom.

This guide is your starting point. It covers the three things you need to begin: the small home kit that turns your laptop into a science workstation, the free software professional labs actually use, and the public datasets that count as real research data.

Why This Is Possible Now

Three shifts in the last decade changed what a high school student can do.

First, public biological data exploded. Every major genome, protein structure, single-cell atlas, drug bioactivity table, and patient-level summary statistic is now downloadable for free. You can pull the same TCGA tumor data, the same UK Biobank summary stats, and the same AlphaFold structures that a graduate student pulls.

Second, free GPU compute became normal. Google Colab and Kaggle give you a real NVIDIA GPU for several hours a day at no cost. That is enough to fold proteins with ColabFold, dock ligands with DiffDock, fine-tune a transformer, or run a molecular dynamics trajectory in OpenMM.

Third, the heavy software went open source. AlphaFold, ESM2, RDKit, GROMACS, OpenMM, PyTorch, scikit-learn, NEURON, Brian2, Mesa, NetLogo, COBRApy, and DeepChem are all free to install. Twenty years ago, each of these would have cost a license fee or required a lab affiliation.

Put together: a laptop on a kitchen counter plus a Colab tab now equals a working computational biology workstation.

The Computational Biology Home Kit

Most of your "kit" is software, but a small physical setup makes your project measurable and presentable.

Workstation basics

  • A laptop with at least 8 GB RAM (any modern Chromebook, Mac, or Windows machine works for Colab-first projects).
  • A free Google account for Colab and Drive storage.
  • A second monitor or tablet for reading papers while you code (optional but useful).

Optional wet-lab and citizen-science add-ons

  • Clip-on smartphone microscope lens (~$10) for pond-water or microorganism imaging.
  • Smartphone tripod and a windowsill timer setup for time-lapse photography of cultures.
  • Kitchen-safe culture materials: kombucha SCOBY, baker's yeast, sprouted seeds, garden soil.
  • Kitchen scale, balloons, and a metric tape measure for fermentation rate or CO₂ displacement measurements (~$15 total).
  • Citizen-science accounts on iNaturalist, eBird, FoldIt, and Eterna (free).

Optional consumer biosensors

  • A consumer EEG headset such as Muse or OpenBCI Ganglion (under $200) for attention, sleep, or focus studies.
  • A smartphone heart-rate or pulse-ox app paired with a notebook for physiology pilot data.
  • A USB microphone for bioacoustic recording (mosquito wingbeats, bird calls, voice samples).

Lab notebook stack

  • A bound notebook or a Notion/Obsidian vault for daily entries.
  • A free GitHub account for version-controlling code and saving notebooks.

Total cost for everything above, if you buy the optional pieces: roughly $0 to $250, and most strong projects use less than $50 of physical hardware.

The Signature Technique: Running Professional Pipelines on a Free Colab GPU

Computational biology's signature move is taking a tool that used to need a server room and running it from a browser tab. Once you can do this, the rest of the field opens up. Here is the five-step workflow.

  1. Open a fresh Colab notebook at colab.research.google.com and switch the runtime to GPU under Runtime, Change runtime type.
  2. Mount Google Drive so your inputs, model weights, and outputs persist between sessions. One cell, two lines.
  3. Install the tool with pip or conda inside the notebook. ColabFold, OpenMM, RDKit, DeepChem, AutoDock Vina, and Hugging Face transformers all install in a few minutes.
  4. Run on a small test input first. Fold one short protein. Dock one ligand. Simulate one short trajectory. Confirm the output makes sense before scaling up.
  5. Scale to your real dataset and save outputs to Drive. Log runtimes, parameters, and random seeds in your notebook so the run is reproducible.

This loop is how you fold a protein, run a molecular dynamics simulation, train a graph neural network, or build an epidemiological forecasting pipeline. The same five steps cover every dry-lab tool below.

The Dry-Lab Side: Free Software You Can Install Today

Structure and structural biology

  • PyMOL and ChimeraX for viewing and annotating protein structures.
  • ColabFold for AlphaFold2 protein structure prediction in a browser.
  • ESMFold (via the ESM2 library) for fast single-sequence structure prediction.
  • Boltz-1 for newer open-source structure prediction including complexes.

Docking and drug design

  • AutoDock Vina and Smina for classical ligand docking.
  • DiffDock for diffusion-model-based pose prediction.
  • RDKit for cheminformatics, molecular descriptors, and SMILES handling.
  • DeepChem for chemistry-focused machine learning pipelines.
  • REINVENT and fragment-based generators for de novo molecular design.

Molecular dynamics and biophysics

  • OpenMM for GPU-accelerated molecular dynamics on Colab.
  • GROMACS for full-feature classical simulations.
  • Martini (in OpenMM) for coarse-grained membrane and nanoparticle simulations.

Systems biology and simulation

  • COBRApy for genome-scale metabolic flux balance analysis.
  • Tellurium and COPASI for ODE-based pathway modeling.
  • PhysiCell for multicellular tissue-scale simulations.
  • Mesa (Python) and NetLogo for agent-based models.
  • Gillespie SSA in Python for stochastic biochemical simulations.

Neuroscience modeling

  • NEURON and Brian2 for biophysical and spiking neural network models.

Machine learning core stack

  • scikit-learn for classical ML and statistics.
  • PyTorch and JAX for deep learning.
  • PyTorch Geometric for graph neural networks on molecules and biological networks.
  • Hugging Face for transformers, foundation models, and pretrained checkpoints.

Bioinformatics pipelines

  • Snakemake and Nextflow (with nf-core templates) for reproducible workflows.
  • Biopython and scanpy for sequence and single-cell data wrangling.

Running these on your own machine, not just reading about them, changes how research feels. You stop being a student of biology and start being a user of it.

Public Databases That Count as Real Data

Sequence and gene annotation

  • NCBI for GenBank, RefSeq, SRA, and almost every public sequence ever deposited.
  • Ensembl for genome browsers and comparative genomics.
  • UniProt for protein sequences and functional annotation.
  • Pfam and InterPro for protein domains and families.
  • KEGG for pathways and metabolic networks.

Expression and single-cell atlases

  • GTEx for human tissue gene expression.
  • GEO and Expression Atlas for thousands of public expression studies.
  • Human Cell Atlas and Tabula Sapiens for single-cell reference maps.
  • ENCODE and Roadmap Epigenomics for regulatory element tracks.

Variation and population genetics

  • 1000 Genomes for global human variation.
  • gnomAD for allele frequencies across populations.
  • UK Biobank summary statistics for GWAS-scale phenotype-genotype links.

Cancer and disease genomics

  • TCGA (via cBioPortal and GDC) for multi-omic tumor data.

Structures and drug-target data

  • PDB for experimentally determined protein structures.
  • AlphaFold DB for predicted structures of nearly every known protein.
  • ChEMBL, DrugBank, PubChem, and BindingDB for compound bioactivity.
  • ZINC for purchasable molecular libraries.

Clinical and pharmacology

  • FAERS for adverse-event reports.
  • Stanford HIVdb for resistance mutations.
  • MIMIC-IV for de-identified ICU data (with a free credentialing step).

Epidemiology and public health

  • WHO, CDC WONDER, and CDC NWSS for disease and wastewater surveillance.
  • OWID COVID and Johns Hopkins CSSE archives.
  • HealthData.gov for U.S. open health datasets.
  • Google Trends and Google Community Mobility archives for behavioral signal.

Environmental and ecological

  • ERA5 climate reanalysis, NASA land-surface temperature, NLCD land cover.
  • iNaturalist and eBird for citizen-science species occurrence.
  • OpenStreetMap for geographic context.

Re-analyzing a public dataset with a new method is itself a legitimate research path. Some of the strongest ISEF computational projects never generate a single new data point.

How to Combine Wet and Dry: The Strongest Project Shape

Pattern A: Household measurement calibrates a simulation. Grow something simple at home (a SCOBY, sprouted seeds, yeast fermenting in a balloon, pond-water microbes on a smartphone microscope). Measure it over time. Then fit a model (agent-based, ODE, flux balance, or reaction-diffusion) to your measurements and use the model to predict conditions you did not test.

Pattern B: Public data trains a model, your measurements stress-test it. Train a classifier, forecaster, or structure predictor on a public dataset. Then collect a small original dataset at home (citizen-science observations, smartphone recordings, survey responses) and evaluate how the model generalizes. Failure modes are the interesting finding.

Judges respond to this hybrid shape because it shows you can both write code and touch the real system the code claims to describe.

Choosing a Phenomenon That Has Not Been Done

Originality is a process, not a guess. Three steps:

  1. Google Scholar. Search your candidate question in quotes plus the method you plan to use. Read the first two pages of results. If someone has done your exact study, refine the angle (different population, different method, different metric).
  2. Society for Science abstracts archive. Search the public ISEF and Regeneron STS abstract databases for your keywords. This shows you what other high school students have already pitched.
  3. PubMed. Filter to review articles from the last five years on your topic. Reviews tell you what the open questions are, which is exactly where your project should sit.

Finding closely related prior work is good news. It means the question is real, the data exists, and you can position your project as the next adjacent step.

A Realistic Timeline

  • One to two weeks: Pick one public dataset and one tool. Reproduce a published analysis end-to-end and write up what you learned.
  • One to two months: Full hybrid project for a regional fair. One model, one original measurement or one new dataset slice, a clean figure set, and a poster.
  • Full year: ISEF-track project. Multiple methods compared, ablations, a held-out validation set, and a writeup that reads like a short paper.

If this is your first research project, start with the one-to-two-week version. Finishing a small project teaches you more than half-finishing a big one.

A Starter Checklist

  1. A quiet workspace with your laptop, charger, and notebook in one spot.
  2. A free Google account with Colab opened and a GPU runtime tested.
  3. A local Python environment (Anaconda or Miniconda) with NumPy, pandas, scikit-learn, PyTorch, RDKit, and Biopython installed.
  4. PyMOL or ChimeraX installed for any structure-related work.
  5. A GitHub account with one empty repository named after your project.
  6. A lab notebook (paper or digital) with the date on the first page.
  7. A single sentence at the top of page one: your one-line research question.

If you have those seven things, you are ready to pick a phenomenon.

Where to Go Next

ISEF organizes computational biology into seven subcategories. Each has its own MehtA+ project guide that builds directly on the kit and tools above. Pick the one that pulls you in.

  • Computational Biomodeling (MOD) — simulating biological systems with agent-based, ODE, PDE, or stochastic models.
  • Computational Epidemiology (EPD) — forecasting and analyzing disease spread with public health, climate, and behavioral data.
  • Computational Evolutionary Biology (EVO) — phylogenetics, selection scans, and comparative genomics across species.
  • Computational Neuroscience (NEU) — modeling neurons, networks, and brain data from EEG, fMRI, and connectomes.
  • Computational Pharmacology (PHA) — drug discovery, docking, pharmacokinetics, and adverse-event analysis.
  • Genomics (GEN) — sequence analysis, variant interpretation, single-cell, and foundation models for DNA and RNA.
  • Other (OTH) — cross-cutting projects in federated learning, multi-omics integration, bioacoustics, and AI-for-bio tooling.

The same laptop, the same Colab tab, and the same public databases support every one of these. Computational biology research from your kitchen counter is not a workaround. It is the real thing.

Project ideas in this category (70)

ABCD fMRI Social Media and Anxiety Patterns

Computational Neuroscience · Advanced

AI School Policy Search for Disease Spread

Computational Epidemiology · Advanced

AlphaFold Protein Complex Modeling with Cross-Link Data

Other · Advanced

Ancestral Antifreeze Protein Evolution

Computational Evolutionary Biology · Advanced

Asthma Polygenic Risk Across Ancestries

Genomics · Advanced

Bayesian Models of Optical Illusions

Computational Neuroscience · Advanced

Blood-Brain Barrier Prediction for Antipsychotic Leads

Computational Pharmacology · Advanced

C. elegans Connectome Simulation Project

Computational Neuroscience · Advanced

Causal Discovery in Sleep, Glycemia, and Mood

Other · Advanced

Circadian Splicing in Human Tissues

Genomics · Advanced

Comparative Genomics of Longevity Gene Loss

Computational Evolutionary Biology · Advanced

CRISPR Base Editing Outcome Prediction

Genomics · Advanced

CRISPR Off-Target Binding With Gillespie Models

Computational Biomodeling · Advanced

Decoding Imagined vs. Heard Speech in fMRI

Computational Neuroscience · Advanced

Dengue Outbreak Forecasting With Public Data

Computational Epidemiology · Advanced

Detecting Human Positive Selection in Genome Data

Computational Evolutionary Biology · Advanced

DiffDock Drug Docking for Antimalarial Targets

Computational Pharmacology · Advanced

Dopamine Reward Signals in Reinforcement Learning

Computational Neuroscience · Intermediate

Early Disease Signal Detection With NLP

Computational Epidemiology · Advanced

Fairness in Histopathology AI Models

Other · Advanced

Federated Sepsis Prediction With Privacy Analysis

Other · Advanced

Finding Micropeptides in lncRNAs

Genomics · Advanced

Flagging Polypharmacy Risks in Adverse Event Reports

Computational Pharmacology · Advanced

Generative PROTAC Linkers for KRAS-G12D

Computational Pharmacology · Advanced

Gut Microbiome Fiber Response Modeling

Computational Biomodeling · Advanced

Heat-Visit Prediction With Satellite Data

Computational Epidemiology · Advanced

HIV Resistance Mutation Prediction

Computational Pharmacology · Advanced

Household Dust Metagenomics and Allergy Links

Genomics · Advanced

Immune Cell Infiltration in Tumor Spheroids

Computational Biomodeling · Advanced

Language Models for Cytochrome C Evolution

Computational Evolutionary Biology · Advanced

LLM Bioinformatics Pipeline Generation

Other · Advanced

Martini MD for Nanoparticle Membrane Wrapping

Computational Pharmacology · Advanced

Measles County Simulation and Tipping Points

Computational Epidemiology · Advanced

Menstrual Hormone ODE Modeling for PCOS Sampling

Computational Biomodeling · Advanced

Migraine Cortical Spreading Depression Model

Computational Neuroscience · Advanced

Modeling Antibiotic Norms and AMR Spread

Computational Evolutionary Biology · Advanced

Modeling SCOBY Biofilm Growth in Kombucha

Computational Biomodeling · Intermediate

Modeling Vaccine Hesitancy on Social Media

Computational Epidemiology · Advanced

Multi-Omics VAE for Metabolic Stress

Other · Advanced

Nanopore Plasmid Mapping in Wastewater

Genomics · Advanced

Network Pharmacology for Diabetes Herb Formulas

Computational Pharmacology · Advanced

PBPK Modeling of Metformin Dosing in Kids vs Adults

Computational Pharmacology · Advanced

Phage-Host Coevolution Models

Computational Evolutionary Biology · Advanced

PhysiCell Fibroblast Migration for Wound Healing

Computational Biomodeling · Advanced

Plant Circadian Clock Modeling for Climate Shifts

Computational Biomodeling · Advanced

Plant-Pollinator Coevolution Gene Analysis

Computational Evolutionary Biology · Advanced

Pollen Tube Growth Physics Simulation Project

Computational Biomodeling · Advanced

Predicting Autism Noncoding Variant Effects

Genomics · Advanced

Predicting Lyme Disease Spread With Climate Data

Computational Epidemiology · Advanced

Predicting Promoter Strength in Crops

Genomics · Advanced

Privacy-Preserving DNA Kinship Search

Other · Advanced

QSAR Screening for Herb-Drug Interactions

Computational Pharmacology · Advanced

Rice Pan-Genome Variant Analysis

Genomics · Advanced

RNN Sleep Stage Modeling for Shift Work

Computational Neuroscience · Advanced

SARS-CoV-2 Spike Evolution in Animal Reservoirs

Computational Evolutionary Biology · Advanced

School Air Quality and Absenteeism Analysis

Computational Epidemiology · Advanced

SEIR Modeling of UTI Spread in Water Networks

Computational Epidemiology · Advanced

Single-Cell RNA-Seq Senescence State Discovery

Genomics · Advanced

Smartphone EEG for Attention Drift Prediction

Computational Neuroscience · Advanced

Smartphone Mosquito Wingbeat Detection

Other · Advanced

Smartphone Pond Microbe Biodiversity Monitoring

Other · Intermediate

Smartphone Voice Models for Early Neuro Markers

Computational Neuroscience · Advanced

SMILES and Bioassay Search Models

Other · Advanced

Spotted Lanternfly Risk Modeling for 2050

Computational Evolutionary Biology · Advanced

Tinnitus Spiking Network Model

Computational Neuroscience · Advanced

Urban Animal Genetics and City Heat Islands

Computational Evolutionary Biology · Intermediate

Warfarin Dose Prediction With Explainable ML

Computational Pharmacology · Advanced

Wastewater and Absence Data for ER Surge Prediction

Computational Epidemiology · Advanced

Yeast Metabolism Under Household Stressors

Computational Biomodeling · Intermediate

Zebrafish Stripe Simulation and Pattern Fitting

Computational Biomodeling · Advanced

Shopping Cart