Early Disease Signal Detection With NLP

ISEF Category: Computational Biology and Bioinformatics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Computational Epidemiology · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

Outbreaks rarely start with a lab result. They often start with small clues, like a weird cluster of posts, a local news blip, or a sudden burst of concern in a forum. Your job is to teach a computer to spot those clues before official counts catch up. That is a real data science problem with real stakes.

What Is It?

This project asks you to build a text analysis pipeline that scans public sources for early hints of disease activity. Think of it like a smoke detector for language. Instead of looking for fire, you look for patterns such as sudden mentions of symptoms, animal die-offs, clinic crowding, or worried local reporting.

The hard part is that most of those mentions are noisy. People joke, speculate, repost, or talk about unrelated illness. So you need a method that separates signal from chatter. NLP, or natural language processing, helps computers read text, group similar posts, and score how likely a post points to a real event.

You then test your pipeline on past events, such as known mpox or avian flu spillover timelines. If your method flags patterns before confirmation dates, that gives you a measurable way to judge performance. You are not trying to diagnose anyone. You are testing whether public text can act as an early warning layer.

Why This Is a Good Topic

This is a strong science fair topic because you can define clear inputs, outputs, and benchmarks. You can compare your model against a timeline of confirmed outbreak events, which makes the project testable. The real-world link is obvious, since early detection can help public health teams respond faster. You can also learn text cleaning, classification, time series analysis, and evaluation metrics in one project.

Research Questions

How does adding Reddit posts change early outbreak signal detection compared with using news articles alone?
What is the effect of different keyword filters on false alarms in pre-syndromic surveillance?
Does a time-shifted model detect confirmed spillover events earlier than a simple mention-count baseline?
To what extent do location tags improve the accuracy of outbreak signal timing?
Which text features, such as symptom words, animal terms, or urgency phrases, best predict a later confirmed event?
How does weekly aggregation compare with daily aggregation for detecting weak early signals?

Basic Materials

Laptop with internet access.
Python installed with Jupyter Notebook.
Free text data sources from local news sites, Reddit, and ProMED archives.
Spreadsheet software for tracking event dates and model results.
A small curated list of confirmed outbreak dates from public health reports.
Basic reference articles on outbreak surveillance and NLP.
Headphones or a browser extension for reading long article sets without distractions.

Advanced Materials

Laptop or desktop with a dedicated GPU, optional.
Python with spaCy, scikit-learn, pandas, NumPy, and matplotlib.
Sentence embedding tools such as SentenceTransformers.
Named entity recognition resources for locations, species, and symptoms.
Annotated validation set for labeling outbreak-related and unrelated posts.
Access to historical public health timelines and archived news datasets.
A database or search index such as SQLite or Elasticsearch for large text collections.

Software & Tools

Python: Handles text cleaning, feature extraction, modeling, and evaluation.
Jupyter Notebook: Lets you test ideas, inspect errors, and document your analysis.
spaCy: Extracts entities, tokenizes text, and supports custom NLP pipelines.
pandas: Organizes posts, dates, sources, and event labels into clean tables.
scikit-learn: Trains baseline classifiers and compares model performance.

Experiment Steps

Define the outbreak events you will benchmark and the exact dates you will treat as ground truth.
Select your text sources and decide how you will label a post as disease-related, ambiguous, or irrelevant.
Build a baseline that counts mentions over time before adding NLP features.
Design a comparison between source mixes, such as news only versus news plus Reddit plus ProMED.
Plan an evaluation method that measures early warning value, false alarms, and delay relative to confirmation.
Check whether your model still works after you change one assumption, such as the time window or keyword list.

Common Pitfalls

Treating every mention of a disease as a true signal, which inflates false positives.
Mixing confirmation dates with reporting dates, which makes the model look earlier than it really is.
Pulling too much data from one source, which lets source bias overpower the result.
Using vague labels like relevant or not relevant without a clear annotation rule, which weakens model training.
Ignoring local context words, which causes the system to confuse rumors, news summaries, and firsthand reports.

What Makes This Competitive

A stronger project goes beyond keyword counting. You can compare multiple source combinations, test several early warning definitions, and report confidence intervals on detection delay. You can also use a careful labeling scheme and show where the model fails, not just where it works. That kind of analysis makes the work feel like real surveillance research, not just a text mining demo.

Project Variations

Use only one source type, such as Reddit or local news, to test which channel gives the earliest warning.
Switch from disease names to symptom and animal-interaction language to see whether indirect signals appear sooner.
Compare a simple keyword model with an NLP classifier that uses embeddings or named entities.
Build the same pipeline for a different outbreak type, such as foodborne illness or wildfire smoke health complaints.

Learn More

CDC Outbreak and surveillance resources: Read public health background on outbreak detection and surveillance methods on the CDC website.
ProMED-mail archive: Search archived reports to study how unusual disease events are described over time.
PubMed: Search for review articles on digital disease surveillance, syndromic surveillance, and outbreak prediction.
NIH PubMed Central: Find free full-text papers on NLP for public health and epidemic monitoring.
MIT OpenCourseWare, Introduction to Machine Learning: Use the free course materials to review classification, evaluation, and model validation.
NOAA National Weather Service archives: Study how alert systems are structured, then compare their logic with disease surveillance alerts.

Computational Biology and Bioinformatics Category Guide

How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →