QSAR Screening for Herb-Drug Interactions

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Computational Pharmacology · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A missed herb-drug interaction can matter as much as a missed drug side effect. St. John's wort, grapefruit compounds, and ashwagandha can all change how your body handles medicines. Your project can train a model to catch risky compounds before they slip through. That gives you a real safety problem, real public data, and a clean way to test machine learning.

What Is It?

This project asks a simple question with real health stakes, can a low-cost machine learning model help spot herb compounds that block CYP450 enzymes? CYP450 is a family of liver enzymes that break down many medicines. If an herb slows one of those enzymes, a drug can stay in the body longer than expected. If an herb speeds one up, a drug can clear too fast. Either way, the interaction can change the dose a person really gets.

QSAR means quantitative structure-activity relationship. That sounds long, but the idea is simple. You use chemical structure as input and ask the model to predict a property, such as whether a compound inhibits a CYP enzyme. Think of it like teaching a computer to connect shape and features to behavior. Your active-learning loop then picks the most uncertain compounds, sends them back into the training set, and retrains the model. That can help the model learn faster and miss fewer dangerous compounds.

Why This Is a Good Topic

This is a strong science fair topic because you can test a clear prediction, measure model performance, and improve the model in a planned way. You can connect it to medication safety, which makes the work feel real, not abstract. You can also do meaningful original research without a wet lab, since public chemistry and bioactivity databases already hold useful data. A student can learn feature engineering, model evaluation, class imbalance, and active learning, all of which matter in real computational biology.

Research Questions

How does active learning change false negative rate for CYP450 inhibition prediction compared with a one-shot QSAR model?
What is the effect of using different molecular fingerprints on recall for herb compounds with sparse training data?
Does adding public herb-specific compounds improve prediction for St. John's wort, grapefruit furanocoumarins, and ashwagandha?
To what extent does class imbalance correction improve sensitivity without hurting precision?
Which active-learning query strategy finds the most uncertain herb compounds for retraining?
How does model performance change when you predict one CYP isozyme at a time versus a combined inhibition label?

Basic Materials

Laptop with enough memory to run Python notebooks.
Python 3 with pandas, scikit-learn, RDKit, and matplotlib.
Spreadsheet software for tracking compounds and labels.
PubChem access for structure files and compound IDs.
PubMed or public database records for checking assay context.
Optional cloud notebook account if your laptop is slow.

Advanced Materials

Workstation or cloud environment with Python, RDKit, scikit-learn, XGBoost, and SHAP.
Curated CYP450 inhibition datasets from ChEMBL, PubChem BioAssay, or literature tables.
Additional herb constituent lists from natural product databases and review articles.
Jupyter Notebook for reproducible analysis.
Version control with Git for tracking model changes.
High-quality plotting package such as seaborn or plotly.

Software & Tools

Python: Runs the modeling workflow, data cleaning, and evaluation scripts.
RDKit: Converts chemical structures into fingerprints and molecular descriptors.
scikit-learn: Trains baseline classifiers and scores recall, precision, and ROC metrics.
Jupyter Notebook: Keeps code, notes, and plots in one reproducible file.
PubChem: Provides compound structures and identifiers for herb constituents and control molecules.

Experiment Steps

Define the exact prediction task, such as binary CYP inhibition or one enzyme class at a time.
Collect public labeled compounds and clean the structures so the same molecule appears only once.
Choose a baseline model and one or two molecular representations to compare.
Design an active-learning loop that selects the most uncertain compounds for retraining.
Plan evaluation metrics that punish false negatives, such as recall, F1, and balanced accuracy.
Prepare a comparison set focused on herb-derived compounds and document where the model fails.

Common Pitfalls

Mixing assay types from different sources, which can blur the label because one dataset may test inhibition while another tests binding.
Letting duplicate compounds leak into both training and test sets, which makes the model look better than it really is.
Using only accuracy, which hides false negatives in a rare-event prediction task.
Skipping structure cleaning for salts, mixtures, or stereoisomers, which can create mismatched fingerprints.
Training on broad druglike compounds only, which can make the model weak on plant-derived molecules.

What Makes This Competitive

A competitive version goes beyond building a normal classifier. You compare several molecular representations, test class-imbalance methods, and show whether active learning really lowers false negatives on herb compounds. You can also split by CYP isozyme and test whether the model generalizes across supplement classes. Strong submissions explain the error patterns, not just the final score.

Project Variations

Focus only on grapefruit furanocoumarins and compare CYP3A4 prediction methods.
Swap in a multi-label model that predicts several CYP enzymes at once.
Add explainability analysis with SHAP or feature importance to find which chemical motifs drive false negatives.

Learn More

PubChem: Search compound records and bioassays for herb constituents and CYP-related assay data on the NIH database portal.
ChEMBL: Search assay and activity records for enzyme inhibition data and download structured tables for modeling.
NIH PubMed: Search review articles on herb-drug interactions, CYP450 inhibition, and QSAR methods.
RDKit Book: Read the free online documentation for molecular fingerprints, descriptors, and cheminformatics workflows.
MIT OpenCourseWare, 6.86 Machine Learning: Use the open course materials to review supervised learning, evaluation, and class imbalance ideas.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →