Clinical Note Text Mining for Drug Effects

ISEF Category: Translational Medical Science

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A hospital note can hide clues that a drug is doing more than its label says. If you can spot those clues in text, you can turn messy records into testable hypotheses. That is the core of reverse translation, starting from patient care and working back to biology.

What Is It?

This project asks you to read clinical notes the way a detective reads a case file. You look for drug names, symptom words, and patterns that show up together more often than chance. Then you map the drug to known molecular targets in ChEMBL, a database that links compounds to proteins they act on.

Think of it like building a bridge. One side is the bedside, where doctors write free-text notes about what happened to a patient. The other side is the bench, where scientists ask why a drug causes a certain effect. Your job is to connect the two with text mining, which means using computers to pull useful information out of text, and then ranking the strongest mechanistic guesses for later testing.

Why This Is a Good Topic

This is a strong science fair topic because it gives you a real data science problem with a clear medical payoff. You can test whether a drug-symptom pair appears more often than expected, compare different text filters, and check whether the signal points to a plausible target class. You also learn how clinical language, database mining, and hypothesis generation fit together, which is close to how modern translational research works.

Research Questions

How does the choice of symptom dictionary change the number of drug-symptom pairs you detect? ?
What is the effect of restricting analysis to discharge summaries versus all note types? ?
Does adding negation detection reduce false drug-symptom associations? ?
To what extent do high-frequency drugs produce more spurious associations than rare drugs? ?
Which drug-symptom pairs remain significant after controlling for multiple testing? ?
How does mapping candidate drugs to ChEMBL targets change the plausibility ranking of your top associations? ?

Basic Materials

Computer with enough memory to handle large text files.
Access to MIMIC-IV after required training and data use approval.
Python installed with pandas, numpy, scipy, and spaCy.
Basic symptom dictionary from a public biomedical vocabulary such as SNOMED CT terms, UMLS-derived lists, or a curated symptom list.
Spreadsheet software for quick inspection of results.
Reference manager or note-taking app for tracking papers and code decisions.

Advanced Materials

Access to a secured university or hospital research environment for MIMIC-IV.
Python environment with scikit-learn, statsmodels, and spaCy biomedical models.
ChEMBL target annotation files or API access for target mapping.
Natural language processing tools for negation, section splitting, and entity linking.
High-memory workstation or server for batch processing large note collections.
Version control system such as Git for code tracking and reproducibility.

Software & Tools

Python: Processes note text, counts associations, and runs statistical tests.
spaCy: Finds drug and symptom mentions and supports custom text cleaning.
pandas: Organizes clinical note outputs into tables for analysis.
statsmodels: Runs significance tests and multiple-comparison correction.
PubChem: Helps confirm drug identities and normalize names before target mapping.

Experiment Steps

Define one narrow clinical outcome class, one drug group, and one note type so your question stays testable.
Build a text pipeline that extracts drug mentions, symptom mentions, and negation cues from the notes.
Create a comparison rule so you can measure whether a drug-symptom pair appears more often than expected.
Add a database mapping step that links each drug to known ChEMBL targets and groups those targets by pathway or protein family.
Rank your findings by effect size, statistical significance, and biological plausibility, then stress-test them with alternative filters.
Write down the cases that fail your pipeline, because those failures will tell you where the method breaks.

Common Pitfalls

Matching drug names too loosely, which confuses brand names, abbreviations, and unrelated words.
Ignoring negation, which makes phrases like no nausea look like evidence for nausea.
Mixing note sections, which lets problem lists, family history, and current symptoms blur together.
Treating every association as biological, which inflates weak text signals into fake mechanisms.
Skipping multiple-testing correction, which makes a long list of random pairs look meaningful.

What Makes This Competitive

A strong version of this project goes beyond counting word pairs. You would compare multiple extraction rules, report effect sizes, and test whether your findings survive stricter filters. You could also connect the text signal to a target family, then ask whether the target biology matches the symptom pattern in a way that feels mechanistically real. That kind of layered analysis looks much stronger than a simple frequency chart.

Project Variations

Focus on one drug class, such as antibiotics or antidepressants, and compare its symptom profile across note types.
Swap clinical notes for discharge summaries only, then test whether the signal gets cleaner or weaker.
Add target-family analysis, such as GPCRs or ion channels, to see whether certain mechanisms cluster around certain symptoms.

Learn More

MIMIC-IV: Search the official PhysioNet project page and data documentation for dataset structure, access rules, and sample tables.
PhysioNet: A nonprofit repository that hosts MIMIC-IV and other clinical datasets, with documentation and tutorials.
ChEMBL: Search the ChEMBL database and downloads page for drug-target relationship files and target annotations.
NIH PubMed: Search for review articles on clinical text mining, adverse drug event detection, and pharmacovigilance.
MIT OpenCourseWare: Look for free courses on data science, machine learning, and biomedical informatics basics.

Translational Medical Science Category Guide

How to Do Real Translational Medical Science Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →