Long COVID Drug Repurposing With Text Mining
ISEF Category: Translational Medical Science
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Disease Treatment and Therapies · Difficulty: Advanced · Setup: Home Setup · Time: Full Year
The Hook
Some medicines already carry clues about what they might treat next. Their side effects, targets, and trial histories can act like a hidden map. Your job is to read that map with text mining and rank drugs that might fit long COVID symptom clusters. That is a real research problem, not a guessing game.
What Is It?
This project asks you to scan public biomedical text and pull out patterns that might point to drug repurposing. Repurposing means using an approved drug for a new disease. You are not inventing a new molecule. You are looking for overlap between what drugs do, what side effects they cause, and what long COVID patients report.
Think of it like matching keys to locks. A drug has known targets, which are proteins or pathways it affects. Long COVID has symptom clusters, like fatigue, brain fog, sleep problems, or heart symptoms. If a drug touches pathways linked to a symptom cluster, it may deserve a closer look.
You can build a pipeline that reads ClinicalTrials.gov records and PubMed abstracts, extracts drug names, adverse events, and symptom terms, then ranks candidate drugs. A stronger version adds ChEMBL target-pathway data. That lets you compare the biology behind the drug with the biology behind the condition.
Why This Is a Good Topic
This is a strong science fair topic because you can test a clear question with public data and real-world medical relevance. You do not need a wet lab, but you still get to do original work by building your own ranking method, feature set, and validation plan. The project also connects to a real medical need, since long COVID still lacks a clear treatment path. You can learn text mining, biomedical database use, and basic evidence scoring.
Research Questions
- How does including adverse-event overlap change the rank order of candidate repurposed drugs for long COVID?
- What is the effect of using ClinicalTrials.gov records versus PubMed abstracts on the number of candidate drugs found?
- Does adding ChEMBL target-pathway matching improve agreement between ranked candidates and published long COVID hypotheses?
- To what extent do symptom clusters such as fatigue, brain fog, and dysautonomia map to different drug-target pathways?
- Which text-mining approach, keyword matching or LLM-assisted extraction, finds more precise drug-symptom links?
- How does excluding ambiguous drug names affect the stability of the candidate ranking?
Basic Materials
- Laptop or desktop computer with internet access.
- Free PubMed access through the NIH website.
- ClinicalTrials.gov database access through the NIH website.
- ChEMBL public database access.
- Spreadsheet software such as Google Sheets or LibreOffice Calc.
- Python installed locally or in a notebook environment.
- Basic text file editor for cleaning search terms.
- Reference manager such as Zotero for tracking sources.
Advanced Materials
- Laptop or desktop computer with internet access.
- Python with pandas, scikit-learn, spaCy, and a biomedical LLM workflow.
- PubMed abstracts export tool or API access.
- ClinicalTrials.gov API access.
- ChEMBL API or downloadable dataset.
- Network analysis software such as Cytoscape.
- Statistical analysis package such as R or Python statsmodels.
- Optional access to UMLS or other controlled vocabulary resources for synonym mapping.
Software & Tools
- Python: Cleans text, extracts drug and symptom terms, and runs the ranking pipeline.
- pandas: Organizes trial records, abstract data, and candidate scores.
- spaCy: Helps with named-entity extraction and text preprocessing.
- PubMed search tools: Finds review articles and abstracts on long COVID, adverse events, and repurposing methods.
- ClinicalTrials.gov API: Pulls trial records for drug names, conditions, and outcomes.
Experiment Steps
- Define the exact long COVID symptom clusters you will track, then decide how you will translate them into searchable terms.
- Choose your source set, such as ClinicalTrials.gov records, PubMed abstracts, or both, and write down what each source can and cannot tell you.
- Build a candidate extraction plan that identifies drugs, adverse events, and target-pathway terms in a consistent way.
- Design a scoring rule that rewards inverse symptom overlap and biologically plausible target-pathway matches.
- Plan a validation set using known long COVID hypotheses, related post-viral syndromes, or published clinical signals.
- Decide how you will compare ranking methods, then test whether your method improves precision, stability, or agreement with expert-curated evidence.
Common Pitfalls
- Treating every symptom mention as a true adverse event signal, which inflates false matches between drugs and long COVID clusters.
- Failing to normalize drug names, which splits one medicine into several duplicate entries.
- Mixing up disease symptoms with treatment side effects, which makes the overlap score meaningless.
- Using raw LLM outputs without a rule for checking hallucinated drug or pathway claims.
- Ignoring study design bias in trial records and abstracts, which can make common drugs look stronger than they are.
What Makes This Competitive
A strong version of this project does more than count word matches. It defines a clean scoring system, compares at least two extraction methods, and checks whether the ranking stays stable when you change the source data. It also tests whether pathway matching adds value beyond simple symptom overlap. If you can validate your pipeline against a known set of published candidates or expert review signals, your project looks much more like real translational research.
Project Variations
- Focus only on fatigue and brain fog, then compare candidate drugs across neurologic pathway terms.
- Use only ClinicalTrials.gov records and test whether trial outcomes predict which repurposed drugs look promising.
- Replace long COVID with another post-viral syndrome, then see whether the same ranking method finds similar drug classes.
Learn More
- PubMed: Search review articles on long COVID, drug repurposing, and text mining to build your background section.
- ClinicalTrials.gov: Search the public trial registry to learn how studies are described and how intervention names appear.
- ChEMBL: Explore drug targets and bioactivity records for approved compounds and target-pathway links.
- NIH RECOVER Initiative: Read public materials on long COVID research priorities and symptom clusters.
- NCBI Bookshelf: Find free biomedical textbook chapters on pharmacology, bioinformatics, and disease mechanisms.
- MIT OpenCourseWare: Use free course materials on computational biology, data analysis, and machine learning basics.
Translational Medical Science Category Guide
How to Do Real Translational Medical Science Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
