Bayesian Drug Screening for Mpro Lead Discovery
ISEF Category: Translational Medical Science
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Drug Identification and Testing · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
Drug screens can search millions of molecules, but most of that work is wasted. Smart algorithms can shrink the search space fast. Your project asks whether a model can find promising Mpro candidates while testing only a tiny slice of the library. That is a real speedup question, not just a coding exercise.
What Is It?
This project combines machine learning with active learning. Active learning means the model picks the next molecules to test based on what it still does not know. Think of it like playing a guessing game where each guess gives you clues for the next one, instead of checking every card in the deck.
ChemProp is a machine learning tool that reads chemical structures and predicts how likely a molecule is to bind to a target. Bayesian optimization adds a smart search layer. It balances two goals, finding molecules that look good now, and finding molecules that teach the model the most for the next round. In this project, you use that loop on the LIT-PCBA benchmark, a public dataset built for testing virtual screening methods against protein targets like Mpro, the main protease from SARS-CoV-2.
Why This Is a Good Topic
This is a strong science fair topic because you can test a clear claim, whether an active-learning loop can recover strong hits with far fewer evaluations than a full-library screen. The project connects to drug discovery, where faster screening can save time and money. You can measure performance with metrics like enrichment, hit rate, and top-k recovery, which gives you real numbers to compare model strategies.
Research Questions
- How does Bayesian optimization change the number of candidate molecules needed to recover top Mpro hits?
- What is the effect of different initial training set sizes on ChemProp ranking performance?
- Does uncertainty sampling improve hit enrichment more than random sampling on the LIT-PCBA benchmark?
- To what extent does the choice of molecular fingerprint or graph input affect active-learning performance?
- Which acquisition strategy finds lead-like molecules with the fewest total evaluations?
- How does class imbalance in the benchmark affect early-round prediction quality?
Basic Materials
- Computer with a modern CPU and at least 16 GB RAM.
- Python installed with a package manager such as conda or pip.
- Access to the LIT-PCBA benchmark dataset.
- ChemProp codebase or a ChemProp-compatible workflow.
- Spreadsheet software or a notebook tool for tracking rounds and results.
- Internet access for downloading public chemical and benchmark data.
- Digital notebook for recording model settings, seeds, and evaluation metrics.
Advanced Materials
- Workstation or server with a GPU for faster model training.
- Python environment with PyTorch, RDKit, scikit-learn, and Bayesian optimization libraries.
- ChemProp source code and configuration files for custom training runs.
- Molecular docking software for follow-up comparison, if your mentor approves it.
- High-capacity storage for repeated model checkpoints and prediction files.
- Access to literature databases such as PubMed for target background and assay context.
Software & Tools
- Python: Runs the analysis pipeline, model training, and result plots.
- ChemProp: Predicts molecular activity from chemical structure features.
- RDKit: Parses molecules, computes descriptors, and filters lead-like compounds.
- scikit-learn: Calculates evaluation metrics and supports baseline machine learning models.
- Jupyter Notebook: Organizes experiments, code, notes, and figures in one place.
Experiment Steps
- Define the screening goal, the target metric, and the baseline you will compare against.
- Choose the molecular representation and model setup you will keep fixed across runs.
- Design the active-learning loop, including how the model will choose each new batch of molecules.
- Plan controls that separate model gain from simple random sampling luck.
- Set evaluation metrics for early enrichment, hit recovery, and ranking quality.
- Decide how you will test sensitivity to seed choice, batch size, and training set size.
Common Pitfalls
- Training and testing on molecules that are too similar, which inflates performance through data leakage.
- Comparing active learning against random sampling with different starting pools, which makes the benchmark unfair.
- Using only one random seed, which hides unstable model behavior across runs.
- Judging success by accuracy alone, which can look good even when the model misses rare hits.
- Skipping chemical validity and lead-like filters, which can leave you with strong scores on molecules that would fail basic drug property checks.
What Makes This Competitive
A competitive version of this project goes past a simple model run. You compare multiple acquisition rules, multiple seeds, and strong baselines, then report confidence intervals, not just one score. You also explain where the model saves evaluations and where it still fails, especially on rare hits and imbalanced classes. If you add a careful analysis of chemical diversity or transfer to a second target, the project becomes much stronger.
Project Variations
- Try the same active-learning loop on a different LIT-PCBA target to see whether the speedup holds across proteins.
- Replace ChemProp with a fingerprint-based model and compare whether graph learning really adds value.
- Add a lead-likeness filter after ranking and test how much it changes the final hit set.
Learn More
- PubChem: Search for compound records, bioassay summaries, and chemical property data for target background.
- NIH PubMed: Search review articles on virtual screening, active learning, and machine learning in drug discovery.
- RDKit Documentation: Find free documentation for molecular descriptors, fingerprints, and chemical filtering.
- ChemProp GitHub Repository: Read the open-source code and usage notes for molecular property prediction.
- PDB: Search protein structures and assay-related target information for Mpro context.
- MIT OpenCourseWare: Search for free machine learning courses that cover model evaluation, uncertainty, and optimization.
Translational Medical Science Category Guide
How to Do Real Translational Medical Science Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
