Machine Learning for KRAS Synthetic-Lethal Partners

Machine Learning for KRAS Synthetic-Lethal Partners

ISEF Category: Biomedical Engineering

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Synthetic Biology  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

KRAS mutations drive many cancers, but the mutant gene is hard to target directly. That is why researchers look for synthetic-lethal partners, genes that cancer cells need once KRAS is broken. Your project can use machine learning and public genomics data to hunt for those partners. That means you can work on a real cancer problem before ever stepping into a wet lab.

What Is It?

This project asks a simple question with a hard answer, which genes become weak spots when KRAS mutates? Synthetic lethality means two genes or pathways are linked so that blocking one is fine, but blocking both hurts the cell. Think of it like removing one support beam from a bridge, then finding the second beam that makes the whole structure fail.

CRISPRi stands for CRISPR interference. Instead of cutting DNA, it turns genes down. A gRNA library is a set of guide RNAs, short sequences that aim the CRISPR system at many genes. In this kind of project, an ML model learns patterns from DepMap CRISPR-screen data, then you compare its predictions with TCGA mutation and expression data to see which gene pairs look strongest in KRAS-mutant cancer cells.

Why This Is a Good Topic

This is a strong science fair topic because you can test a real biological idea with public data, clear metrics, and a defined output. You do not need to invent a new machine learning method from scratch, but you do need to make design choices about features, validation, and ranking. The topic connects to cancer biology, target discovery, and precision medicine, so your final project has real-world weight. You can learn data cleaning, model evaluation, and how to judge whether a predicted gene partner makes biological sense.

Research Questions

  • How does the choice of ML model change the accuracy of predicting synthetic-lethal partners for KRAS-mutant cells?
  • What is the effect of adding TCGA mutation-expression features on the ranking of candidate gene partners?
  • Does a CRISPRi-based feature set outperform a knockout-based feature set for identifying KRAS synthetic lethality?
  • To what extent do predictions stay consistent across different KRAS mutation subtypes?
  • Which gene classes appear most often among the top-ranked synthetic-lethal partners for common KRAS mutations?
  • How does class imbalance in DepMap training data affect false positive predictions for KRAS partner genes?

Basic Materials

  • Laptop or desktop computer with at least 16 GB RAM.
  • Python installed with Jupyter Notebook.
  • Internet access for downloading public datasets.
  • Spreadsheet software for quick inspection of gene lists.
  • PubMed access for reading review papers.
  • TCGA and DepMap public data files.
  • Basic statistics reference sheet or notes.

Advanced Materials

  • Workstation or university cluster access for model training.
  • Python environment with scikit-learn, pandas, NumPy, and SciPy.
  • R with Bioconductor packages for expression analysis.
  • Access to CRISPRi screen data, including DepMap downloads.
  • TCGA mutation and expression matrices.
  • Gene set enrichment analysis tools.
  • Version control with Git and GitHub or GitLab.

Software & Tools

  • Python: Runs data cleaning, feature engineering, model training, and prediction ranking.
  • Jupyter Notebook: Lets you explore DepMap and TCGA data step by step.
  • scikit-learn: Builds baseline classifiers and regression models for candidate ranking.
  • pandas: Organizes gene-level tables, labels, and merged data sets.
  • RStudio: Supports differential expression checks and downstream statistical analysis.

Experiment Steps

  1. Define the exact KRAS mutation set and the cancer types you will study.
  2. Select the public datasets that will supply training labels, features, and validation targets.
  3. Choose a baseline model first, so you can compare every later change against it.
  4. Design your feature set and decide which gene or sample signals belong in the model.
  5. Plan a validation strategy that checks predictions against independent TCGA patterns and held-out data.
  6. Rank the candidate synthetic-lethal partners and decide how you will score biological plausibility.

Common Pitfalls

  • Training on all KRAS-related samples at once, which leaks information from the test set into the model.
  • Mixing gene symbols and transcript IDs from different datasets, which creates bad merges and missing labels.
  • Treating correlation in TCGA as proof of synthetic lethality, which overstates what the data can support.
  • Ignoring class imbalance, which makes the model predict the majority class and miss rare lethal partners.
  • Picking too many features without filtering, which adds noise and hides the signal from KRAS-specific interactions.

What Makes This Competitive

A stronger project would compare several model families, not just one, and report how stable the top gene hits stay across splits. You could add an independent validation layer, such as pathway enrichment, expression stratification, or survival-linked patterns in TCGA. You would also earn more credibility by testing whether the model finds known KRAS partners before claiming new ones. Clear error analysis matters just as much as the ranking itself.

Project Variations

  • Test whether the same pipeline predicts synthetic-lethal partners for other oncogenes, such as BRAF or EGFR.
  • Swap the ML model for a pathway-based scoring method and compare which approach ranks known KRAS partners higher.
  • Focus on one cancer type only, such as lung adenocarcinoma, to see whether tissue context changes the predicted gene list.

Learn More

  • DepMap: Search the Broad Institute DepMap portal for CRISPR screening data and gene dependency summaries.
  • TCGA via the NCI GDC: Search the National Cancer Institute Genomic Data Commons for mutation and expression data.
  • PubMed: Search for review articles on KRAS synthetic lethality, CRISPRi screening, and cancer dependency maps.
  • NIH NCBI Bookshelf: Find free background chapters on cancer genetics, CRISPR, and transcriptomics.
  • MIT OpenCourseWare: Search for free computational biology, machine learning, and genomics course materials.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart