CRISPR Base Editing Outcome Prediction

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Genomics · Difficulty: Advanced · Setup: Home Setup · Time: 1 to 2 Months

The Hook

A single DNA letter change can fix a mutation, or cause extra edits nearby. That makes base editing powerful, and tricky. If you can predict those extra changes, you can help make gene editing safer and cleaner. That is a strong science fair problem with real stakes.

What Is It?

CRISPR base editors are tools that change one DNA letter into another without cutting both DNA strands. Think of them like a tiny spell checker that swaps one character in a sentence. The catch is that the editor sometimes changes nearby letters too. Those extra changes are called bystander edits.

Your project asks whether a computer model can predict those outcomes from the DNA sequence around the target site. The sequence context means the letters before, after, and near the target base. A machine learning model can look for patterns humans may miss, like whether certain neighboring bases make bystander edits more likely. Attention-based interpretation helps you see which parts of the sequence the model uses most when it makes a prediction.

Why This Is a Good Topic

This topic works well for a science fair because you can test a clear prediction problem with public data, no wet lab required. You get a real biomedical question, one real-world connection to safer gene editing, and a project that uses data science, model evaluation, and interpretation. You can start simple, then make the project stronger by comparing models, testing different sequence windows, or checking whether your model generalizes to new datasets.

Research Questions

How does sequence context around a base editor target affect the accuracy of bystander edit prediction?
What is the effect of changing the input window size on model performance for base editing outcome prediction?
Does an attention-based model identify sequence positions linked to bystander edits better than a simpler baseline model?
To what extent do different base editor datasets produce the same sequence-context signals for bystander editing?
Which neighboring nucleotides most strongly predict extra edits near the target base?
How does model performance change when you train on one public dataset and test on another?

Basic Materials

Laptop or desktop computer with enough memory to run Python notebooks.
Stable internet access for downloading public datasets.
Python installed through Anaconda or Miniconda.
Jupyter Notebook or Google Colab for model building.
Public DeepBE and BE-Hive datasets from published sources.
Spreadsheet software for quick data checks.
Headphones or a quiet workspace for long training runs.

Advanced Materials

University or lab workstation with a GPU.
Python environment with PyTorch or TensorFlow.
Data storage for multiple public genomics datasets.
Genome annotation tools for checking sequence features.
Version control with Git for tracking model changes.
GPU monitoring tools for training diagnostics.
Statistical software for comparing model outputs across datasets.

Software & Tools

Python: Runs data cleaning, feature building, model training, and evaluation notebooks.
Jupyter Notebook: Lets you document analysis, code, and figures in one place.
Google Colab: Provides free cloud compute for smaller model training runs.
PyTorch: Builds neural network models for sequence prediction tasks.
scikit-learn: Computes baseline models, metrics, and train-test splits.
pandas: Organizes sequence and outcome data into tables for analysis.
Matplotlib: Makes performance plots, calibration charts, and attention summaries.

Experiment Steps

Define the prediction target, such as bystander edit count, edit pattern class, or outcome probability.
Choose one public dataset as your main training set and decide how you will split it without leaking similar sequences across splits.
Build a simple baseline model first, so you can measure whether the attention model adds value.
Design the sequence input format and test which window size gives the cleanest signal.
Train the attention-based model and plan how you will translate attention weights into sequence-context insight.
Set up an external validation test on a separate dataset or editor type to see whether your model generalizes.

Common Pitfalls

Splitting nearly identical guide sequences across training and test sets, which makes the model look better than it really is.
Treating attention weights as proof of causation, when they only show what the model used.
Mixing different outcome labels from DeepBE and BE-Hive without checking that they mean the same thing.
Using too many sequence features at once, which can hide the effect of nearby bases.
Reporting only training accuracy, which misses poor performance on unseen sequences.

What Makes This Competitive

A strong version of this project does more than train one model. You compare against clear baselines, test whether the model works on a separate dataset, and measure performance with metrics that fit the task. You also explain the biology behind the patterns, not just the code. If you connect attention results to specific sequence-context rules and check them across datasets, your project starts to look like real computational genomics research.

Project Variations

Predict only whether a bystander edit happens, instead of the full outcome pattern.
Compare an attention model with a random forest or logistic regression baseline on the same sequence features.
Test whether models trained on one editor family still work on another editor family or on a held-out genomic context.

Learn More

PubMed: Search for review articles on CRISPR base editing and bystander edits to get the biology background.
NCBI GEO: Find public sequence and genomics datasets that may help you compare with other editing studies.
NIH Bookshelf: Read free background chapters on genome editing and machine learning basics in biology.
MIT OpenCourseWare: Search for free lectures on machine learning, statistics, and computational biology.
arXiv: Search for preprints on base editing prediction models and sequence-to-outcome modeling.
Nature Biotechnology: Search the journal for peer-reviewed base editing studies and model interpretation papers.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →