How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Computational biology used to live inside university clusters with budgets most schools never see. Today, the same models, the same genomes, and the same simulation engines run on a free Google Colab tab from your bedroom.
This guide is your starting point. It covers the three things you need to begin: the small home kit that turns your laptop into a science workstation, the free software professional labs actually use, and the public datasets that count as real research data.
Why This Is Possible Now
Three shifts in the last decade changed what a high school student can do.
First, public biological data exploded. Every major genome, protein structure, single-cell atlas, drug bioactivity table, and patient-level summary statistic is now downloadable for free. You can pull the same TCGA tumor data, the same UK Biobank summary stats, and the same AlphaFold structures that a graduate student pulls.
Second, free GPU compute became normal. Google Colab and Kaggle give you a real NVIDIA GPU for several hours a day at no cost. That is enough to fold proteins with ColabFold, dock ligands with DiffDock, fine-tune a transformer, or run a molecular dynamics trajectory in OpenMM.
Third, the heavy software went open source. AlphaFold, ESM2, RDKit, GROMACS, OpenMM, PyTorch, scikit-learn, NEURON, Brian2, Mesa, NetLogo, COBRApy, and DeepChem are all free to install. Twenty years ago, each of these would have cost a license fee or required a lab affiliation.
Put together: a laptop on a kitchen counter plus a Colab tab now equals a working computational biology workstation.
The Computational Biology Home Kit
Most of your "kit" is software, but a small physical setup makes your project measurable and presentable.
Workstation basics
- A laptop with at least 8 GB RAM (any modern Chromebook, Mac, or Windows machine works for Colab-first projects).
- A free Google account for Colab and Drive storage.
- A second monitor or tablet for reading papers while you code (optional but useful).
Optional wet-lab and citizen-science add-ons
- Clip-on smartphone microscope lens (~$10) for pond-water or microorganism imaging.
- Smartphone tripod and a windowsill timer setup for time-lapse photography of cultures.
- Kitchen-safe culture materials: kombucha SCOBY, baker's yeast, sprouted seeds, garden soil.
- Kitchen scale, balloons, and a metric tape measure for fermentation rate or CO₂ displacement measurements (~$15 total).
- Citizen-science accounts on iNaturalist, eBird, FoldIt, and Eterna (free).
Optional consumer biosensors
- A consumer EEG headset such as Muse or OpenBCI Ganglion (under $200) for attention, sleep, or focus studies.
- A smartphone heart-rate or pulse-ox app paired with a notebook for physiology pilot data.
- A USB microphone for bioacoustic recording (mosquito wingbeats, bird calls, voice samples).
Lab notebook stack
- A bound notebook or a Notion/Obsidian vault for daily entries.
- A free GitHub account for version-controlling code and saving notebooks.
Total cost for everything above, if you buy the optional pieces: roughly $0 to $250, and most strong projects use less than $50 of physical hardware.
The Signature Technique: Running Professional Pipelines on a Free Colab GPU
Computational biology's signature move is taking a tool that used to need a server room and running it from a browser tab. Once you can do this, the rest of the field opens up. Here is the five-step workflow.
- Open a fresh Colab notebook at colab.research.google.com and switch the runtime to GPU under Runtime, Change runtime type.
- Mount Google Drive so your inputs, model weights, and outputs persist between sessions. One cell, two lines.
- Install the tool with pip or conda inside the notebook. ColabFold, OpenMM, RDKit, DeepChem, AutoDock Vina, and Hugging Face transformers all install in a few minutes.
- Run on a small test input first. Fold one short protein. Dock one ligand. Simulate one short trajectory. Confirm the output makes sense before scaling up.
- Scale to your real dataset and save outputs to Drive. Log runtimes, parameters, and random seeds in your notebook so the run is reproducible.
This loop is how you fold a protein, run a molecular dynamics simulation, train a graph neural network, or build an epidemiological forecasting pipeline. The same five steps cover every dry-lab tool below.
The Dry-Lab Side: Free Software You Can Install Today
Structure and structural biology
- PyMOL and ChimeraX for viewing and annotating protein structures.
- ColabFold for AlphaFold2 protein structure prediction in a browser.
- ESMFold (via the ESM2 library) for fast single-sequence structure prediction.
- Boltz-1 for newer open-source structure prediction including complexes.
Docking and drug design
- AutoDock Vina and Smina for classical ligand docking.
- DiffDock for diffusion-model-based pose prediction.
- RDKit for cheminformatics, molecular descriptors, and SMILES handling.
- DeepChem for chemistry-focused machine learning pipelines.
- REINVENT and fragment-based generators for de novo molecular design.
Molecular dynamics and biophysics
- OpenMM for GPU-accelerated molecular dynamics on Colab.
- GROMACS for full-feature classical simulations.
- Martini (in OpenMM) for coarse-grained membrane and nanoparticle simulations.
Systems biology and simulation
- COBRApy for genome-scale metabolic flux balance analysis.
- Tellurium and COPASI for ODE-based pathway modeling.
- PhysiCell for multicellular tissue-scale simulations.
- Mesa (Python) and NetLogo for agent-based models.
- Gillespie SSA in Python for stochastic biochemical simulations.
Neuroscience modeling
- NEURON and Brian2 for biophysical and spiking neural network models.
Machine learning core stack
- scikit-learn for classical ML and statistics.
- PyTorch and JAX for deep learning.
- PyTorch Geometric for graph neural networks on molecules and biological networks.
- Hugging Face for transformers, foundation models, and pretrained checkpoints.
Bioinformatics pipelines
- Snakemake and Nextflow (with nf-core templates) for reproducible workflows.
- Biopython and scanpy for sequence and single-cell data wrangling.
Running these on your own machine, not just reading about them, changes how research feels. You stop being a student of biology and start being a user of it.
Public Databases That Count as Real Data
Sequence and gene annotation
- NCBI for GenBank, RefSeq, SRA, and almost every public sequence ever deposited.
- Ensembl for genome browsers and comparative genomics.
- UniProt for protein sequences and functional annotation.
- Pfam and InterPro for protein domains and families.
- KEGG for pathways and metabolic networks.
Expression and single-cell atlases
- GTEx for human tissue gene expression.
- GEO and Expression Atlas for thousands of public expression studies.
- Human Cell Atlas and Tabula Sapiens for single-cell reference maps.
- ENCODE and Roadmap Epigenomics for regulatory element tracks.
Variation and population genetics
- 1000 Genomes for global human variation.
- gnomAD for allele frequencies across populations.
- UK Biobank summary statistics for GWAS-scale phenotype-genotype links.
Cancer and disease genomics
- TCGA (via cBioPortal and GDC) for multi-omic tumor data.
Structures and drug-target data
- PDB for experimentally determined protein structures.
- AlphaFold DB for predicted structures of nearly every known protein.
- ChEMBL, DrugBank, PubChem, and BindingDB for compound bioactivity.
- ZINC for purchasable molecular libraries.
Clinical and pharmacology
- FAERS for adverse-event reports.
- Stanford HIVdb for resistance mutations.
- MIMIC-IV for de-identified ICU data (with a free credentialing step).
Epidemiology and public health
- WHO, CDC WONDER, and CDC NWSS for disease and wastewater surveillance.
- OWID COVID and Johns Hopkins CSSE archives.
- HealthData.gov for U.S. open health datasets.
- Google Trends and Google Community Mobility archives for behavioral signal.
Environmental and ecological
- ERA5 climate reanalysis, NASA land-surface temperature, NLCD land cover.
- iNaturalist and eBird for citizen-science species occurrence.
- OpenStreetMap for geographic context.
Re-analyzing a public dataset with a new method is itself a legitimate research path. Some of the strongest ISEF computational projects never generate a single new data point.
How to Combine Wet and Dry: The Strongest Project Shape
Pattern A: Household measurement calibrates a simulation. Grow something simple at home (a SCOBY, sprouted seeds, yeast fermenting in a balloon, pond-water microbes on a smartphone microscope). Measure it over time. Then fit a model (agent-based, ODE, flux balance, or reaction-diffusion) to your measurements and use the model to predict conditions you did not test.
Pattern B: Public data trains a model, your measurements stress-test it. Train a classifier, forecaster, or structure predictor on a public dataset. Then collect a small original dataset at home (citizen-science observations, smartphone recordings, survey responses) and evaluate how the model generalizes. Failure modes are the interesting finding.
Judges respond to this hybrid shape because it shows you can both write code and touch the real system the code claims to describe.
Choosing a Phenomenon That Has Not Been Done
Originality is a process, not a guess. Three steps:
- Google Scholar. Search your candidate question in quotes plus the method you plan to use. Read the first two pages of results. If someone has done your exact study, refine the angle (different population, different method, different metric).
- Society for Science abstracts archive. Search the public ISEF and Regeneron STS abstract databases for your keywords. This shows you what other high school students have already pitched.
- PubMed. Filter to review articles from the last five years on your topic. Reviews tell you what the open questions are, which is exactly where your project should sit.
Finding closely related prior work is good news. It means the question is real, the data exists, and you can position your project as the next adjacent step.
A Realistic Timeline
- One to two weeks: Pick one public dataset and one tool. Reproduce a published analysis end-to-end and write up what you learned.
- One to two months: Full hybrid project for a regional fair. One model, one original measurement or one new dataset slice, a clean figure set, and a poster.
- Full year: ISEF-track project. Multiple methods compared, ablations, a held-out validation set, and a writeup that reads like a short paper.
If this is your first research project, start with the one-to-two-week version. Finishing a small project teaches you more than half-finishing a big one.
A Starter Checklist
- A quiet workspace with your laptop, charger, and notebook in one spot.
- A free Google account with Colab opened and a GPU runtime tested.
- A local Python environment (Anaconda or Miniconda) with NumPy, pandas, scikit-learn, PyTorch, RDKit, and Biopython installed.
- PyMOL or ChimeraX installed for any structure-related work.
- A GitHub account with one empty repository named after your project.
- A lab notebook (paper or digital) with the date on the first page.
- A single sentence at the top of page one: your one-line research question.
If you have those seven things, you are ready to pick a phenomenon.
Where to Go Next
ISEF organizes computational biology into seven subcategories. Each has its own MehtA+ project guide that builds directly on the kit and tools above. Pick the one that pulls you in.
- Computational Biomodeling (MOD) — simulating biological systems with agent-based, ODE, PDE, or stochastic models.
- Computational Epidemiology (EPD) — forecasting and analyzing disease spread with public health, climate, and behavioral data.
- Computational Evolutionary Biology (EVO) — phylogenetics, selection scans, and comparative genomics across species.
- Computational Neuroscience (NEU) — modeling neurons, networks, and brain data from EEG, fMRI, and connectomes.
- Computational Pharmacology (PHA) — drug discovery, docking, pharmacokinetics, and adverse-event analysis.
- Genomics (GEN) — sequence analysis, variant interpretation, single-cell, and foundation models for DNA and RNA.
- Other (OTH) — cross-cutting projects in federated learning, multi-omics integration, bioacoustics, and AI-for-bio tooling.
The same laptop, the same Colab tab, and the same public databases support every one of these. Computational biology research from your kitchen counter is not a workaround. It is the real thing.
Project ideas in this category (70)
Computational Biology and Bioinformatics · Computational Neuroscience · Advanced
AI School Policy Search for Disease SpreadComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
AlphaFold Protein Complex Modeling with Cross-Link DataComputational Biology and Bioinformatics · Other · Advanced
Ancestral Antifreeze Protein EvolutionComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
Asthma Polygenic Risk Across AncestriesComputational Biology and Bioinformatics · Genomics · Advanced
Bayesian Models of Optical IllusionsComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
Blood-Brain Barrier Prediction for Antipsychotic LeadsComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
C. elegans Connectome Simulation ProjectComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
Causal Discovery in Sleep, Glycemia, and MoodComputational Biology and Bioinformatics · Other · Advanced
Circadian Splicing in Human TissuesComputational Biology and Bioinformatics · Genomics · Advanced
Comparative Genomics of Longevity Gene LossComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
CRISPR Base Editing Outcome PredictionComputational Biology and Bioinformatics · Genomics · Advanced
CRISPR Off-Target Binding With Gillespie ModelsComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Decoding Imagined vs. Heard Speech in fMRIComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
Dengue Outbreak Forecasting With Public DataComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Detecting Human Positive Selection in Genome DataComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
DiffDock Drug Docking for Antimalarial TargetsComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Dopamine Reward Signals in Reinforcement LearningComputational Biology and Bioinformatics · Computational Neuroscience · Intermediate
Early Disease Signal Detection With NLPComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Fairness in Histopathology AI ModelsComputational Biology and Bioinformatics · Other · Advanced
Federated Sepsis Prediction With Privacy AnalysisComputational Biology and Bioinformatics · Other · Advanced
Finding Micropeptides in lncRNAsComputational Biology and Bioinformatics · Genomics · Advanced
Flagging Polypharmacy Risks in Adverse Event ReportsComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Generative PROTAC Linkers for KRAS-G12DComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Gut Microbiome Fiber Response ModelingComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Heat-Visit Prediction With Satellite DataComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
HIV Resistance Mutation PredictionComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Household Dust Metagenomics and Allergy LinksComputational Biology and Bioinformatics · Genomics · Advanced
Immune Cell Infiltration in Tumor SpheroidsComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Language Models for Cytochrome C EvolutionComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
LLM Bioinformatics Pipeline GenerationComputational Biology and Bioinformatics · Other · Advanced
Martini MD for Nanoparticle Membrane WrappingComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Measles County Simulation and Tipping PointsComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Menstrual Hormone ODE Modeling for PCOS SamplingComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Migraine Cortical Spreading Depression ModelComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
Modeling Antibiotic Norms and AMR SpreadComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
Modeling SCOBY Biofilm Growth in KombuchaComputational Biology and Bioinformatics · Computational Biomodeling · Intermediate
Modeling Vaccine Hesitancy on Social MediaComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Multi-Omics VAE for Metabolic StressComputational Biology and Bioinformatics · Other · Advanced
Nanopore Plasmid Mapping in WastewaterComputational Biology and Bioinformatics · Genomics · Advanced
Network Pharmacology for Diabetes Herb FormulasComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
PBPK Modeling of Metformin Dosing in Kids vs AdultsComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Phage-Host Coevolution ModelsComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
PhysiCell Fibroblast Migration for Wound HealingComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Plant Circadian Clock Modeling for Climate ShiftsComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Plant-Pollinator Coevolution Gene AnalysisComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
Pollen Tube Growth Physics Simulation ProjectComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
Predicting Autism Noncoding Variant EffectsComputational Biology and Bioinformatics · Genomics · Advanced
Predicting Lyme Disease Spread With Climate DataComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Predicting Promoter Strength in CropsComputational Biology and Bioinformatics · Genomics · Advanced
Privacy-Preserving DNA Kinship SearchComputational Biology and Bioinformatics · Other · Advanced
QSAR Screening for Herb-Drug InteractionsComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Rice Pan-Genome Variant AnalysisComputational Biology and Bioinformatics · Genomics · Advanced
RNN Sleep Stage Modeling for Shift WorkComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
SARS-CoV-2 Spike Evolution in Animal ReservoirsComputational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
School Air Quality and Absenteeism AnalysisComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
SEIR Modeling of UTI Spread in Water NetworksComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Single-Cell RNA-Seq Senescence State DiscoveryComputational Biology and Bioinformatics · Genomics · Advanced
Smartphone EEG for Attention Drift PredictionComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
Smartphone Mosquito Wingbeat DetectionComputational Biology and Bioinformatics · Other · Advanced
Smartphone Pond Microbe Biodiversity MonitoringComputational Biology and Bioinformatics · Other · Intermediate
Smartphone Voice Models for Early Neuro MarkersComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
SMILES and Bioassay Search ModelsComputational Biology and Bioinformatics · Other · Advanced
Spotted Lanternfly Risk Modeling for 2050Computational Biology and Bioinformatics · Computational Evolutionary Biology · Advanced
Tinnitus Spiking Network ModelComputational Biology and Bioinformatics · Computational Neuroscience · Advanced
Urban Animal Genetics and City Heat IslandsComputational Biology and Bioinformatics · Computational Evolutionary Biology · Intermediate
Warfarin Dose Prediction With Explainable MLComputational Biology and Bioinformatics · Computational Pharmacology · Advanced
Wastewater and Absence Data for ER Surge PredictionComputational Biology and Bioinformatics · Computational Epidemiology · Advanced
Yeast Metabolism Under Household StressorsComputational Biology and Bioinformatics · Computational Biomodeling · Intermediate
Zebrafish Stripe Simulation and Pattern FittingComputational Biology and Bioinformatics · Computational Biomodeling · Advanced
