Modeling Vaccine Hesitancy on Social Media
ISEF Category: Computational Biology and Bioinformatics
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Computational Epidemiology · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
A rumor can spread faster than a virus. One post can jump from a tiny online group to thousands of feeds in hours, then show up later as missed vaccines in a real county. That makes social media a data source, not just a conversation. Your job is to model how the signal moves.
What Is It?
This project studies how vaccine-hesitancy posts spread online and whether those waves line up with lower vaccination rates in real communities. Think of each post as a pebble dropped into water. The ripples are the reposts, replies, and quotes that follow. A Bayesian hierarchical model is a math model that combines several levels of data, so you can look at both individual posts and county-level patterns at the same time.
You are not trying to prove that one tweet caused one missed shot. That would be too simple, and social data rarely works that way. You are testing whether stronger hesitancy cascades tend to appear in places with bigger uptake gaps, after you account for location, time, and other factors. The Bayesian part helps you handle uncertainty instead of pretending every measurement is exact.
Why This Is a Good Topic
This is a strong science fair topic because the question is measurable, but not easy. You can track post volume, sentiment, repost networks, and county vaccination data, then ask whether they move together. That gives you real stats work, real data cleaning, and a public health angle. You can also build a project that feels original because you choose the platform mix, the time window, and the way you define a diffusion cascade.
Research Questions
- How does the size of a vaccine-hesitancy cascade on social media relate to county-level vaccine uptake gaps?
- What is the effect of sentiment strength on the speed of repost spread within hesitancy clusters?
- Does including Bluesky data improve the model fit compared with using Twitter/X archives alone?
- To what extent do county demographics change the link between online hesitancy and uptake gaps?
- Which cascade features, such as depth, breadth, or persistence, best predict later uptake differences?
- How does the timing of hesitancy posts before local vaccination campaigns affect the observed county gap?
- What is the effect of different sentiment classifiers on the estimated strength of the diffusion model?
Basic Materials
- Laptop with at least 16 GB RAM and reliable internet access.
- Python installed with Jupyter Notebook.
- Access to public Twitter/X archives or an approved social data archive.
- Access to Bluesky public posts through a data dump or firehose snapshot.
- County-level vaccination data from CDC or state health departments.
- Census county demographics data from the U.S. Census Bureau.
- Spreadsheet software for early data review and cleaning.
- Git version control for tracking code changes.
Advanced Materials
- University or institutional server access for larger text datasets.
- Python environment with PyMC or Stan for Bayesian modeling.
- Natural language processing libraries for sentiment and topic tagging.
- Network analysis tools for cascade mapping.
- Secure storage for large scraped or archived text corpora.
- County shapefiles for spatial comparison.
- NIH, CDC, or state public health datasets for validation.
- Optional GPU access for faster text processing.
Software & Tools
- Python: Cleans text, merges datasets, and runs the Bayesian model.
- Jupyter Notebook: Lets you document code, plots, and notes in one place.
- Pandas: Organizes post-level and county-level data into analysis tables.
- PyMC: Fits Bayesian hierarchical models and estimates uncertainty.
- Gephi: Visualizes repost networks and cascade structure.
Experiment Steps
- Define the exact outcome you want to explain, such as county uptake gaps or post cascade growth.
- Choose one social platform strategy, then decide how you will match posts to counties and dates.
- Build a clean dataset that separates post features, network features, and county features.
- Plan a baseline model first, then add hierarchical layers so you can test whether the extra structure helps.
- Decide how you will score sentiment, cascade size, and model fit before you run the analysis.
- Set up validation checks that compare your model against simpler alternatives and known county patterns.
Common Pitfalls
- Mixing Twitter/X archives and Bluesky posts without harmonizing time windows, which makes diffusion rates look different for the wrong reason.
- Using a weak sentiment labeler, which can misread sarcasm, quotes, and neutral news posts as hesitancy.
- Treating repost counts as independent, which ignores network clustering and inflates confidence.
- Matching posts to counties too loosely, which creates fake geographic links.
- Fitting a complex Bayesian model before checking a simpler baseline, which hides whether the extra structure actually improves prediction.
What Makes This Competitive
A stronger project will do more than count posts. You can test several model structures, compare them with proper Bayesian criteria, and show which features of a cascade matter most. You can also compare two platforms, two sentiment methods, or two ways of linking online data to counties. That kind of careful design turns a noisy social problem into a clear analytical story.
Project Variations
- Use Reddit health discussions instead of Bluesky to see whether another platform shows the same diffusion pattern.
- Focus on flu or COVID vaccine hesitancy, then compare whether the cascade structure changes by topic.
- Replace county uptake gaps with ZIP code or state-level data to test whether the geographic signal changes with scale.
Learn More
- CDC Vaccination Data: Search the CDC data portal for state and county vaccine coverage tables.
- U.S. Census Bureau: Find county demographics and population estimates in the American Community Survey tables.
- NIH PubMed: Search for review articles on vaccine hesitancy, social media diffusion, and Bayesian modeling.
- MIT OpenCourseWare: Look for course materials on probability, statistics, and Bayesian inference.
- PyMC Documentation: Read the official docs for building Bayesian hierarchical models in Python.
- Gephi: Use the free software site and tutorials to learn network visualization for cascades.
Computational Biology and Bioinformatics Category Guide
How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
