Predicting Blood-Brain Barrier Permeability with Machine Learning

ISEF Category: Biochemistry

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Medicinal Biochemistry · Difficulty: Advanced · Setup: University Lab · Time: 1 to 2 Months

The Hook

A molecule can look promising and still fail to reach the brain. The blood-brain barrier acts like a strict security gate, so many compounds never get through. If you can predict permeability before a lab test, you save time and cut down on dead ends. That makes this project a smart mix of chemistry, biology, and machine learning.

What Is It?

The blood-brain barrier is a thin layer of cells around the brain's blood vessels. Think of it like a picky bouncer. It blocks many molecules, so only the right size, shape, and chemical makeup can pass.

Your project uses machine learning to learn those patterns from a molecule library such as COCONUT. RDKit descriptors turn each compound into numbers, such as size, polarity, and flexibility. A graph neural network skips most hand-built numbers and reads the molecule as atoms and bonds, like studying a map instead of a list of stops.

Why This Is a Good Topic

This is a strong science fair topic because you can test clear models against each other, measure real performance, and explain why the best model works. It connects to drug delivery, brain medicines, and the hard job of getting a compound to the right place in the body. You also learn data cleaning, feature design, validation, and model interpretation, which are useful skills for many research paths.

Research Questions

How does using RDKit descriptors compare with a graph neural network for blood-brain barrier permeability prediction?
What is the effect of scaffold splitting on model performance compared with random splitting?
Does adding polarity and flexibility descriptors improve predictions over size-only features?
To what extent do class imbalance fixes change balanced accuracy and AUPRC?
Which molecular features most strongly separate predicted permeable and impermeable natural products?
What is the effect of removing close duplicates from the dataset on final model scores?

Basic Materials

Laptop with at least 8 GB RAM.
Internet access to download COCONUT and related data.
Python 3.10 or newer.
Jupyter Notebook or Google Colab.
RDKit installed in Python.
scikit-learn.
pandas and numpy.
PyTorch and PyTorch Geometric if you plan to train a graph neural network.
Spreadsheet software or a lab notebook for tracking compounds and split rules.

Advanced Materials

Linux workstation with an NVIDIA GPU and CUDA support.
Access to a high-memory compute server for model training.
A curated blood-brain barrier benchmark set with clear labels and source notes.
PyTorch Geometric or Deep Graph Library.
SHAP or integrated gradients for model interpretation.
Version-controlled storage for large datasets, checkpoints, and analysis logs.

Software & Tools

RDKit: Calculates molecular descriptors, fingerprints, and structure-cleaning steps for each compound.
Python: Runs the data cleaning, model training, and evaluation code.
Google Colab: Lets you train models without buying your own GPU.
scikit-learn: Builds baseline classifiers and compares validation strategies.
PyTorch Geometric: Trains graph neural networks on molecular graphs.

Experiment Steps

Define your target label, either binary blood-brain barrier permeability or a continuous permeability score, and decide where the labels will come from.
Clean the compound list so salts, duplicates, and near-duplicates do not leak into both training and test sets.
Build an RDKit descriptor baseline first, then use it as the yardstick for the graph neural network.
Choose one honest validation plan, then compare random splits with scaffold splits to see how well the model handles new chemistry.
Select metrics that fit imbalanced data, such as AUROC, AUPRC, balanced accuracy, and MCC.
Pick an interpretation method that matches the model, then decide which chemical features or atoms you will test against the outputs.

Common Pitfalls

Leaving duplicate natural products in both train and test sets, which makes the model look better than it is.
Mixing blood-brain barrier labels from different assays without checking their source, which blurs the target you are trying to learn.
Reporting only accuracy on a skewed dataset, which hides a model that guesses the majority class.
Comparing a descriptor model and a graph neural network on different split rules, which turns the architecture test into a split test.
Treating feature importance as chemistry proof without checking the underlying structures, which can turn noise into a story.

What Makes This Competitive

A stronger version of this project does more than predict one label. It compares a clean descriptor baseline, a graph neural network, and at least one external test set. Scaffold splits, calibration plots, and error analysis show whether the model generalizes to new chemistry. If you connect the predictions to known blood-brain barrier rules and explain the false positives, your project starts to look like real drug-discovery work.

Project Variations

Swap the natural-product library for approved drugs and see whether the same features still separate permeable from nonpermeable molecules.
Test a fingerprint-based model instead of a descriptor model, then compare whether structure patterns or hand-built descriptors help more.
Predict continuous logBB values instead of a binary label, then check whether the ranking matches the classification model.

Learn More

PubMed: Search review articles on blood-brain barrier permeability, natural products, and cheminformatics.
PubChem: Look up compound structures, computed properties, and linked literature records.
RDKit documentation: Find descriptor, fingerprint, and molecule-cleaning functions in the official docs.
COCONUT database: Download natural-product structures and read the dataset notes on the project page.
scikit-learn User Guide: Read about cross-validation, imbalanced classification, and model metrics in the official documentation.

Biochemistry Category Guide

How to Do Real Biochemistry Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →