Triage Chatbots With Calibrated Safety Checks

Triage Chatbots With Calibrated Safety Checks

ISEF Category: Biomedical Engineering

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A chatbot that sounds confident can still be wrong in a dangerous way. In triage, that matters because missed red flags can change what happens next. Your project asks a sharp question, can an AI give useful help without acting more sure than it should?

What Is It?

This project studies a medical triage chatbot, which is a system that helps sort symptoms into levels of urgency. Think of it like a careful front desk assistant in an emergency room, not a doctor. It reads a question, looks up relevant medical text, and gives a response with an uncertainty score so you can tell how sure it feels.

The retrieval-augmented part means the model does not rely only on what it memorized. It also searches a trusted document set, then uses those sources to shape its answer. That can reduce blank guessing, but it can also create new failure modes if the wrong text gets pulled in or the model sounds safer than it really is.

The fairness part checks whether the model performs differently across demographic groups, such as age, sex, or race. That matters because a triage tool should not flag one group more easily than another just because of bias in the training data. In this kind of project, you are not building a doctor. You are testing whether an AI safety layer can stay honest, accurate, and fair.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with clear metrics. You can measure sensitivity to red-flag symptoms, calibration of uncertainty, and differences in performance across groups. The topic connects to a real problem, safer first-pass medical support, and it gives you room to study bias, prediction quality, and error tradeoffs. You can learn how modern medical AI systems are judged, not just how they answer questions.

Research Questions

  • How does adding retrieval from medical documents change red-flag symptom detection accuracy?
  • What is the effect of calibrated uncertainty on the model's ability to avoid overconfident wrong answers?
  • Does the chatbot perform differently across demographic subgroups in a held-out triage set?
  • To what extent does fine-tuning on discharge summaries improve triage classification compared with a base model?
  • Which red-flag symptom categories produce the highest false-negative rate?
  • How does the quality of retrieved evidence affect the final urgency score?

Basic Materials

  • Laptop or desktop computer with enough memory to run inference or cloud notebook access.
  • Curated medical text dataset or benchmark set with triage-style questions.
  • Spreadsheet software for logging predictions and labels.
  • Python environment with Jupyter Notebook.
  • Access to a free large language model API or local open-source model.
  • Access to published evaluation labels for red-flag symptoms and demographics.
  • Data dictionary or codebook for the benchmark dataset.

Advanced Materials

  • GPU workstation or university compute cluster.
  • Secure access to de-identified MIMIC-IV or similar clinical dataset.
  • Vector database or embedding index for retrieval experiments.
  • Model fine-tuning framework such as PyTorch with Hugging Face Transformers.
  • Calibration and fairness analysis libraries in Python.
  • Annotation tool for expert review of borderline cases.
  • Statistical testing package for subgroup comparison and uncertainty analysis.

Software & Tools

  • Python: Runs the model pipeline, evaluation scripts, and fairness tests.
  • Jupyter Notebook: Lets you document experiments and compare outputs step by step.
  • Hugging Face Transformers: Supports open-source language models, fine-tuning, and text generation.
  • pandas: Organizes prediction tables, labels, and subgroup results.
  • scikit-learn: Calculates classification metrics, confusion matrices, and calibration-related summaries.

Experiment Steps

  1. Define the exact triage task you will test, such as red-flag detection, urgency ranking, or referral recommendation.
  2. Choose the benchmark labels and subgroup fields you will use so your results stay measurable and fair comparisons stay possible.
  3. Select a baseline model and a retrieval setup so you can compare plain generation against evidence-supported answers.
  4. Plan how you will score both accuracy and uncertainty, including which cases count as high-risk misses.
  5. Design subgroup audits that compare performance across demographic slices and check for uneven false-negative rates.
  6. Predefine a small set of error analyses so you can explain failure patterns, not just report one summary score.

Common Pitfalls

  • Treating a fluent answer as a safe answer, which hides overconfidence in wrong triage calls.
  • Mixing training and test cases from the same patient source, which can inflate performance through leakage.
  • Ignoring calibration and reporting only accuracy, which misses whether the model knows when it is unsure.
  • Comparing demographic groups with too few samples, which makes fairness results noisy and misleading.
  • Using retrieved evidence without checking source quality, which can pull in irrelevant or outdated clinical context.

What Makes This Competitive

A stronger project goes past simple accuracy. You would compare a base model, a retrieval model, and a calibrated version, then show how each one changes safety-related errors. You would also test whether uncertainty scores track actual failure cases and whether subgroup gaps stay small under tougher evaluation. Clear plots, careful statistics, and a thoughtful error analysis can make the work look much more like real biomedical engineering research.

Project Variations

  • Test the same triage pipeline on pediatric cases instead of adult cases to see whether uncertainty patterns change.
  • Swap the retrieval source from discharge summaries to guideline excerpts and compare whether evidence quality improves red-flag detection.
  • Focus only on fairness auditing and compare false-negative rates across demographic groups for the same triage model.

Learn More

  • MedQA on PubMed: Search for papers and benchmark descriptions on medical question answering and evaluation methods.
  • MIMIC-IV on PhysioNet: Find de-identified critical care and discharge summary data, plus documentation and access instructions.
  • NIH PubMed: Search review articles on clinical decision support, calibration, and fairness in medical AI.
  • Nature Digital Medicine: Read peer-reviewed studies on medical machine learning, triage tools, and safety evaluation.
  • MIT OpenCourseWare: Look for machine learning, statistics, and data science courses that help you build and test the model.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart