Small-Molecule Synthesizability Classifiers

Small-Molecule Synthesizability Classifiers

ISEF Category: Biochemistry

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other  ·  Difficulty: Advanced  ·  Setup: Home Setup  ·  Time: 1 to 2 Months

The Hook

A great screening list can still fail if no one can make the molecules. That is why synthesizability matters. Your model can act like a reality check before anyone spends time on dead-end hits. In drug discovery, that can save weeks of work.

What Is It?

A synthesizability classifier predicts how easy a small molecule will be to make in real life. Think of it like a difficulty score for chemistry. A molecule might look great on paper, but if its structure is too tangled, unstable, or unusual, chemists may avoid it.

RAscore and SAscore are two common ways to estimate that difficulty. A graph neural network, or GNN, is a model that reads a molecule as a network of atoms and bonds. Instead of memorizing a formula, it learns patterns in structure, then uses those patterns to judge whether a virtual-screen hit looks practical.

For your project, the big question is not just whether the model works, but whether it helps decision-making. You can compare top-ranked compounds before and after filtering, then see if the model pushes obvious dead ends down the list and lifts more realistic candidates up.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with public data and clear metrics. You do not need a wet lab to study whether the classifier improves hit triage, and you can still ask a real chemistry question: which molecules look good, but are hard to make? The project connects machine learning to drug discovery, synthetic chemistry, and practical design choices. You can learn data cleaning, feature engineering, model evaluation, and how to defend a ranking system with evidence.

Research Questions

  • How does adding a GNN to an SAscore baseline change the ranking of virtual-screen hits?
  • What is the effect of filtering top hits by synthesizability score on the fraction of chemically realistic candidates?
  • Does a model trained on public compound datasets separate easy-to-make and hard-to-make molecules better than a simple descriptor-based score?
  • To what extent do molecular size, ring count, and branching predict low synthesizability scores?
  • Which property cutoff gives the best balance between keeping promising hits and removing impractical ones?
  • How does the model's performance change when you test it on a different chemical library than the one used for training?

Basic Materials

  • Laptop or desktop computer with at least 8 GB RAM.
  • Python installed with Jupyter Notebook access.
  • RDKit for molecule parsing and descriptor calculation.
  • Public molecule dataset from PubChem, ChEMBL, or ZINC.
  • Spreadsheet software for labeling, sorting, and checking results.
  • Access to a free cloud notebook, such as Google Colab, for faster model runs.

Advanced Materials

  • Workstation or cloud GPU access for training a GNN.
  • Curated training set of known synthesis outcomes or labeled accessibility scores.
  • Molecular graph featurization pipeline using RDKit and PyTorch Geometric or DGL.
  • Validation library that is separate from the training set.
  • Model tracking notebook with saved checkpoints and prediction exports.
  • Optional external benchmark set from a public challenge or paper.

Software & Tools

  • Python: Runs the data cleaning, feature extraction, and model training pipeline.
  • RDKit: Converts molecule files into descriptors, fingerprints, and graph inputs.
  • Google Colab: Gives you free notebook access and enough compute for small to medium experiments.
  • Jupyter Notebook: Keeps your analysis, plots, and notes in one place.
  • scikit-learn: Provides baseline classifiers, metrics, and cross-validation tools.

Experiment Steps

  1. Define the exact decision you want the model to make, such as ranking virtual hits by practical synthesizability.
  2. Collect a labeled dataset and decide how you will handle duplicate molecules, salts, and broken entries.
  3. Build a simple baseline first, then compare it with the GNN so you know whether the extra complexity helps.
  4. Choose evaluation metrics that match the goal, such as rank correlation, top-k enrichment, or balanced classification scores.
  5. Plan a holdout test set that comes from a different chemical source so your model faces a fair challenge.
  6. Decide how you will explain the model's top predictions with interpretable molecular features or case studies.

Common Pitfalls

  • Using a dataset with duplicate or messy molecule records, which makes the model learn noise instead of chemistry.
  • Mixing training and test compounds from the same source library, which makes the results look better than they really are.
  • Treating synthesizability as a single yes-or-no label when the real signal is often a spectrum.
  • Comparing a GNN to a weak baseline, which hides whether the new model truly adds value.
  • Reporting accuracy alone, which can miss class imbalance and fail to show whether the top-ranked hits are actually more practical.

What Makes This Competitive

A strong version of this project goes past model training. You can compare multiple score types, test on a separate library, and show whether the filter changes the quality of the top-ranked hit list. The best projects also explain why the model makes its decisions, not just how well it scores. That turns a simple classifier into a real tool for screening decisions.

Project Variations

  • Test the classifier on natural product-like libraries instead of drug-like libraries.
  • Compare a GNN with fingerprint-based models, then see which one best ranks practical hits.
  • Add interpretability analysis to find which structural features most often push molecules out of the top tier.

Learn More

  • PubChem: Search compound records, structure data, and links to related literature for public molecules.
  • ChEMBL: Find bioactive compounds and curated assay data for training and validation sets.
  • RDKit documentation: Learn how to compute molecular descriptors, fingerprints, and graph features.
  • PubMed: Search review articles on synthetic accessibility scoring and molecular property prediction.
  • MIT OpenCourseWare: Find free courses on machine learning and computational chemistry topics.
  • NIH/NLM resources: Use NCBI and related databases to connect compounds, papers, and biological context.
Shopping Cart