Symptom-Triage Chatbots for Science Fair Research Ideas

ISEF Category: Biomedical and Health Sciences

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: Home Setup · Time: 1 to 2 Months

The Hook

A chatbot can sound calm and still give bad advice. In health care, that matters fast, because a missed emergency can waste time a student or patient does not have. Your project asks a simple question: can an open-source model help sort symptoms without making up facts? That turns a flashy AI demo into a real safety test.

What Is It?

Symptom triage means sorting a problem by urgency, not naming a disease. Think of it like a digital front desk at a clinic. The bot should decide whether a symptom needs emergency care, prompt medical attention, or simple self-care.

In this project, you start with Llama-3, an open-source large language model. You fine-tune it on MedQA, a medical question set, and HealthSearchQA, a set of health search questions, then test it with vignettes, short case descriptions that mimic real patient stories. You look for hallucinations, which are answers that sound confident but add facts that are not in the case or are plain wrong.

Why This Is a Good Topic

This is a strong science fair topic because you can measure it in a clear way. You can score safety, urgency, and hallucination rate, then compare different prompts or training sets. The project connects to a real problem, people already use chatbots for health questions, and bad advice can cause harm. You can learn model evaluation, error analysis, and how to turn messy text into numbers.

Research Questions

How does fine-tuning on MedQA versus HealthSearchQA change the rate of unsafe triage advice?
What is the effect of adding a refusal-first system prompt on hallucinated diagnoses?
Does retrieval from trusted medical summaries improve referral accuracy on borderline symptoms?
To what extent do model answers match medical-student vignette labels for emergency, prompt, and routine cases?
Which symptom types, such as chest pain, fever, or stomach pain, produce the highest hallucination rate?
How does prompt length affect the chatbot's consistency across repeated runs?

Basic Materials

Laptop or desktop computer with at least 16 GB RAM.
Python 3.11 environment.
Jupyter Notebook or VS Code.
Open-source Llama-3 model access or a small local model for testing.
CSV file of de-identified test vignettes with answer labels.
Spreadsheet or database for manual error tracking.

Advanced Materials

CUDA-capable GPU workstation or server.
Hugging Face Transformers and PEFT or LoRA training stack.
Curated benchmark set of de-identified clinical vignettes with clinician labels.
Annotation interface for expert review, such as REDCap or a structured spreadsheet.
Secure storage for sensitive text and model outputs.
Python or R statistical package for error analysis and significance testing.

Software & Tools

Python: Runs preprocessing, model inference, and scoring scripts.
Jupyter Notebook: Lets you inspect outputs and compare error patterns.
pandas: Organizes vignettes, labels, and model responses in tables.
Hugging Face Transformers: Loads Llama-3 and handles fine-tuning workflows.
scikit-learn: Calculates accuracy, recall, and confusion matrices.

Experiment Steps

Define the triage labels and decide what counts as unsafe, vague, or correct.
Build a holdout set of vignettes that covers emergencies, routine cases, and borderline symptoms.
Choose a baseline model and one or two fine-tuned versions so you can compare changes.
Design a scoring rubric for hallucinations, wrong urgency, refusal quality, and missing red flags.
Run each vignette through every model version with the same prompt format.
Compare the error rates, then check which symptom types still break the system.

Common Pitfalls

Training on medical Q&A without separating the test vignettes, which makes the chatbot look better because it has already seen similar cases.
Scoring an answer as safe just because it sounds cautious, even when it still misses emergency red flags.
Mixing plain symptom questions with diagnosis questions, which blurs whether the model can triage or only name diseases.
Using vignettes with one obvious answer, which hides how the chatbot behaves on borderline cases.
Letting answer length vary a lot, which can make longer replies seem better even when safety does not improve.

What Makes This Competitive

A strong version of this project does more than count right and wrong answers. It tests whether the chatbot gives safe urgency advice, refuses when it should, and stays consistent across different symptom patterns. Add a clean holdout set, human review of borderline cases, and a comparison between baseline, fine-tuned, and safety-guarded prompts. That turns it into a real evaluation study instead of a demo.

Project Variations

Test the chatbot on pediatric, adult, and older-adult vignettes to see whether age changes safety.
Compare direct symptom triage with a retrieval-augmented version that quotes trusted sources before answering.
Measure whether a refusal-first prompt lowers hallucinations more than a simple general safety prompt.

Learn More

PubMed: Search review articles on medical chatbots, triage safety, and hallucination reduction.
NIH MedlinePlus: Read plain-language symptom pages and warning signs that help you set safe referral labels.
CDC: Check guidance on emergency symptoms, urgent symptoms, and routine care.
MIT OpenCourseWare: Review machine learning and natural language processing basics before you fine-tune a model.
Nature Digital Medicine: Search peer-reviewed papers on clinical AI, chatbot safety, and evaluation methods.

Biomedical and Health Sciences Category Guide

How to Do Real Biomedical and Health Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →