Drug Target Success Prediction With Knowledge Graphs
ISEF Category: Translational Medical Science
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Other · Difficulty: Advanced · Setup: Home Setup · Time: Full Year
The Hook
Most drug targets fail before they ever help a patient. That failure costs time, money, and lives. You can study that pattern with public data, then ask whether a graph model can spot better targets earlier. This kind of project turns huge biomedical databases into one testable prediction problem.
What Is It?
This project asks a simple question with a big payoff: can you use public biomedical data to guess which drug targets are more likely to make it to late-stage clinical trials? A drug target is a gene, protein, or pathway that a drug tries to affect. Phase III is one of the last steps before a drug can reach approval, so reaching that stage suggests the target survived a lot of filtering.
A knowledge graph is a network of connected facts. Think of it like a subway map for biology and medicine. One station might be a gene, another a disease, another a drug, and the lines between them describe relationships such as “is associated with,” “is targeted by,” or “has trial evidence.” A graph neural network, or GNN, is a machine learning model that learns from both the nodes and the connections between them. That makes it a good fit when the structure of the data matters, not just the labels.
You would build a graph from sources such as ClinVar, OpenTargets, ClinicalTrials.gov, and the FDA Orange Book, then test whether the model can learn patterns from older data and predict later outcomes. The hard part is not just training the model. The hard part is making sure your data cleaning, label definitions, and validation plan are honest and reproducible.
Why This Is a Good Topic
This is a strong science fair topic because it has a clear yes-or-no outcome, uses public data, and connects directly to real drug development decisions. You can test whether network structure adds predictive power beyond simple counts or popularity measures. You also get to learn skills that matter in modern biomedical research, including data wrangling, feature engineering, graph thinking, and retrospective validation. That makes the project both technical and meaningful.
Research Questions
- How does a graph neural network compare with a random forest in predicting whether a preclinical target reaches Phase III within 5 years? ?
- What is the effect of adding ClinicalTrials.gov links on prediction accuracy compared with using target-disease associations alone? ?
- Does including FDA Orange Book drug evidence improve model performance for targets already linked to approved therapies? ?
- To what extent do disease area, target family, and prior human evidence change the probability of late-stage success? ?
- Which network features, such as degree, centrality, or shared disease neighbors, are most associated with Phase III progression? ?
- How does a model trained on 2015 to 2017 data perform when tested on 2018 to 2020 targets? ?
Basic Materials
- Laptop with at least 16 GB RAM, if possible.
- Spreadsheet software for quick inspection of tables.
- Python installed with pandas, scikit-learn, networkx, and PyTorch Geometric or DGL.
- API access or downloadable files from ClinVar, OpenTargets, ClinicalTrials.gov, and the FDA Orange Book.
- External hard drive or cloud storage for versioned data backups.
- Text editor or notebook environment such as JupyterLab or VS Code.
- Reference manager such as Zotero for tracking sources.
Advanced Materials
- Workstation or university cluster access for larger graph training runs.
- Python environment with PyTorch, PyTorch Geometric or DGL, pandas, scikit-learn, and statsmodels.
- SQL database or graph database such as Neo4j for structured integration.
- Access to bulk downloads or APIs from ClinVar, OpenTargets, ClinicalTrials.gov, the FDA Orange Book, and PubMed metadata.
- GPU access for faster GNN training and hyperparameter testing.
- Reproducible workflow tools such as Snakemake or Nextflow.
- ImageJ, only if you add pathway or figure-based annotation workflows.
Software & Tools
- Python: Cleans the biomedical tables, merges identifiers, and builds features from the graph.
- JupyterLab: Lets you explore data, test assumptions, and document each modeling step.
- pandas: Helps you join, filter, and audit large tabular datasets.
- NetworkX: Builds the knowledge graph and computes network measures.
- PyTorch Geometric: Trains graph neural networks on nodes, edges, and target labels.
- scikit-learn: Runs baseline models, cross-validation, and evaluation metrics.
Experiment Steps
- Define your prediction target, such as whether a preclinical target reaches Phase III within a fixed time window.
- Select your data sources and decide which entity types and relationship types belong in the graph.
- Standardize identifiers so one gene, disease, or drug does not appear under multiple names.
- Build a baseline model first, then compare it with a graph-based model to test whether network structure helps.
- Design a time-split validation plan so your model only learns from information available before the prediction date.
- Choose evaluation metrics that match the class imbalance, then inspect the false positives and false negatives for patterns.
Common Pitfalls
- Mixing identifiers from different databases without careful mapping, which creates duplicate genes, drugs, or diseases in the graph.
- Using future information in training data, which makes the model look stronger than it really is.
- Predicting the rare Phase III class with simple accuracy, which hides poor performance on the targets you care about.
- Letting one popular disease area dominate the graph, which can make the model learn trends instead of biology.
- Skipping baseline models, which makes it impossible to tell whether the GNN adds real value.
What Makes This Competitive
A competitive version of this project does more than train a model. It tests the model against strong baselines, uses a strict time-based split, and checks whether gains still hold after controlling for simple predictors like target count or disease count. You can also make the project stronger by comparing different graph designs, such as adding evidence edges versus only using target-disease links. The best entries usually explain not just what predicts success, but why the prediction might be trustworthy.
Project Variations
- Predict which drug targets reach Phase II instead of Phase III, then compare whether the task is easier earlier in development.
- Focus on one disease area, such as cancer or rare diseases, and test whether graph signals work better in a narrower setting.
- Replace the GNN with a simpler graph score model, then measure how much deep learning really adds.
Learn More
- ClinVar: Search the NCBI ClinVar database for variant and gene-disease relationships that can seed parts of your graph.
- Open Targets Platform: Find target-disease evidence summaries and downloadable data for biomedical network building.
- ClinicalTrials.gov: Search study records and download trial metadata to label development progress.
- FDA Orange Book: Use the FDA drug listing database to identify approved drugs and supporting target evidence.
- MIT OpenCourseWare: Search for free courses on machine learning, graph theory, and biomedical data science.
- PubMed: Search for review articles on target validation, drug development pipelines, and graph neural networks in biomedicine.
Translational Medical Science Category Guide
How to Do Real Translational Medical Science Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
