KRAS-G12D Inhibitor Prediction with ChemProp
ISEF Category: Translational Medical Science
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Drug Identification and Testing · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
Cancer cells often act like cars with a stuck accelerator. KRAS-G12D is one of those stuck signals, and many labs are racing to block it. You can join that search with data, code, and chemistry instead of a wet lab. This project lets you predict which molecules might stop the signal before anyone tests them in cells.
What Is It?
This project uses machine learning to guess which chemical structures might block KRAS-G12D, a cancer-related protein target. Think of it like teaching a model to read the shape of a key and predict whether it might fit a lock. ChemProp helps by turning each molecule into a graph, where atoms are nodes and bonds are links.
You train the model on known bioactivity data from ChEMBL, then ask it to score new molecules that have not been tested yet. After that, you filter the top hits with ADMET tools. ADMET stands for absorption, distribution, metabolism, excretion, and toxicity. Those filters help you avoid molecules that look promising on paper but fail basic drug-like checks.
SHAP-style attribution adds another layer. It helps you see which atoms or substructures pushed the prediction up or down. That gives you a chance to explain the model, not just use it.
Why This Is a Good Topic
This makes a strong science fair topic because it has a clear input, a clear output, and real-world stakes. You can test the effect of different training sets, fingerprints, model settings, and filtering rules on prediction quality. You also learn skills that matter in drug discovery, like data cleaning, model validation, and interpretation. A student can complete this with public datasets and free tools, which makes the project realistic if you plan well.
Research Questions
- How does the size of the ChEMBL training set affect ChemProp performance for KRAS-G12D inhibitor prediction?
- What is the effect of adding ADMET filters on the chemical diversity of the top-ranked molecules?
- Does a graph neural network outperform a simpler fingerprint-based model on KRAS-G12D bioactivity prediction?
- To what extent do SHAP-style atom attributions agree with known medicinal chemistry features of active compounds?
- Which molecular substructures are most associated with high predicted KRAS-G12D activity?
- How does the choice of activity threshold change model precision, recall, and ranking of candidate inhibitors?
Basic Materials
- Computer with enough memory to run Python scripts and train small models.
- Python and a notebook environment such as Jupyter.
- Curated ChEMBL dataset for KRAS or related RAS targets.
- Chemical structure file formats such as SMILES or SDF.
- Free account or web access for SwissADME and pkCSM.
- Spreadsheet software for tracking compounds, labels, and scores.
- Basic statistics reference for model validation metrics.
Advanced Materials
- Access to a workstation with a GPU for faster model training.
- Python packages for ChemProp, RDKit, scikit-learn, pandas, numpy, matplotlib, and shap.
- Larger ChEMBL or PubChem-derived structure set for external validation.
- Curated benchmark data for KRAS or closely related targets.
- Molecular visualization software such as PyMOL or UCSF ChimeraX.
- Version control system such as Git for tracking code changes.
- Optional access to a Unix-like environment for reproducible runs.
Software & Tools
- ChemProp: Trains graph neural network models on molecular structures and bioactivity labels.
- RDKit: Cleans structures, calculates descriptors, and helps you inspect molecules.
- scikit-learn: Splits data, scores models, and compares baseline methods.
- SHAP: Estimates which atoms or features push predictions higher or lower.
- Python: Runs the full analysis pipeline and automates data handling.
Experiment Steps
- Define the prediction task by choosing the target label, activity cutoff, and compound inclusion rules.
- Curate a clean dataset from ChEMBL, then remove duplicates, salts, and unclear measurements.
- Split the data in a way that tests generalization, not memorization, and keep a true holdout set.
- Train a ChemProp model and at least one simpler baseline so you can compare performance honestly.
- Rank new candidate molecules, then apply ADMET filters to remove poor drug-like options.
- Interpret the strongest candidates with atom-level attribution and plan a validation strategy for the next stage.
Common Pitfalls
- Mixing assay types or measurement units, which turns the training labels into noise.
- Letting similar molecules appear in both train and test sets, which makes the model look better than it really is.
- Using raw ChEMBL entries without standardizing salts, stereochemistry, or duplicate structures.
- Treating one high prediction score as proof of activity, which ignores uncertainty and false positives.
- Skipping external validation, which hides overfitting to one narrow data slice.
What Makes This Competitive
A stronger project does more than train one model. It tests multiple splitting strategies, compares against baselines, and checks whether the model still works on outside data. It also explains why the model prefers certain atoms or motifs, then ties that back to chemistry. A top project would frame the work as a careful screening pipeline, not just a code demo.
Project Variations
- Use KRAS wild-type and KRAS-G12D side by side to compare whether the model learns mutation-specific patterns.
- Swap ChemProp for a fingerprint model or a transformer-based model to compare prediction quality and interpretability.
- Focus on repurposed FDA-approved molecules and ask which ones score well after ADMET filtering.
Learn More
- ChEMBL: Search the ChEMBL database for bioactivity records and target-specific compound data.
- PubMed: Search for review articles on KRAS-G12D inhibitors, ChemProp, and ADMET modeling.
- NIH PubChem: Look up compound structures, synonyms, and linked bioassay data.
- MIT OpenCourseWare: Find free materials on machine learning, cheminformatics, and statistical learning.
- DrugBank: Review drug properties and target information for context, using the free public summaries.
- Journal of Chemical Information and Modeling: Read peer-reviewed papers on molecular machine learning and explainable prediction.
Translational Medical Science Category Guide
How to Do Real Translational Medical Science Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →