On-Device Health Note Summarizer

ISEF Category: Biomedical Engineering

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Your phone may know more about your day than you do. Steps, sleep, meals, and heart rate can become a messy pile of numbers fast. A smart note writer can turn that pile into a clean daily summary, but one wrong line can mislead a clinician. That makes this a strong project for testing accuracy, privacy, and trust.

What Is It?

This project asks you to build an AI agent that reads data from wearables, diet logs, and sleep records, then writes a short clinician-style note. Think of it like a very fast medical scribe. It does not diagnose disease. It turns raw signals into a plain-language summary, such as sleep quality, activity patterns, and possible anomalies.

The hard part is not writing text. The hard part is staying faithful to the source data. Large language models can sound confident even when they fill in missing details. In this project, you compare the note against a synthetic ground truth, which means a known answer set you create for testing. You can then measure hallucinations, which are statements the model invents or gets wrong, and compare an on-device model with a cloud model.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with clear numbers. You can measure hallucination rate, factual accuracy, note completeness, and speed. The project connects to real problems in digital health, like patient privacy and information overload. You can also learn how model choice, prompt design, and data formatting change performance.

Research Questions

How does an on-device LLM compare with a cloud LLM in hallucination rate when both summarize the same wearable and sleep data?
What is the effect of adding structured input fields, such as step count and sleep duration, on note accuracy?
Does including diet logs improve or reduce summary quality for daily clinician-style notes?
To what extent does prompt style change the number of unsupported claims in the generated note?
Which type of synthetic patient profile leads to the most hallucinations, simple profiles or mixed-pattern profiles?
How does model size affect factual consistency, note length, and runtime on a local device?

Basic Materials

Laptop or desktop computer with enough memory to run a small local LLM or API client.
Synthetic wearable, sleep, and diet datasets.
Spreadsheet software for organizing ground-truth labels.
Python installed with data analysis libraries.
CSV editor or text editor for prompt and output files.
Reference rubric for marking factual errors and hallucinations.

Advanced Materials

University or lab workstation with GPU access.
Locally hosted open-source LLM or edge model.
Wearable-style time-series datasets or simulated patient records.
Secure data storage plan for privacy testing.
Annotation interface for scoring factual consistency.
Statistical analysis software for agreement and significance testing.

Software & Tools

Python: Cleans data, runs prompts, and scores output against ground truth.
Pandas: Organizes wearable, diet, and sleep records into tables for analysis.
scikit-learn: Helps compare models and run basic classification or error analysis.
ImageJ: Not needed for this topic, so skip it unless you add chart-based image analysis.
R: Runs statistical tests and creates publication-style plots for results.

Experiment Steps

Define the exact note format you want the model to produce, then decide which facts must come from the input data.
Build a synthetic ground-truth dataset with known daily patterns, then label the claims that should appear in each note.
Choose one input structure and one baseline prompt so you can compare models under the same conditions.
Create a scoring rubric that separates correct statements, missing facts, and hallucinated claims.
Run both the on-device and cloud versions on the same test set, then record accuracy, hallucination rate, and response time.
Analyze where the model fails most often, then test one change at a time, such as prompt format or input ordering.

Common Pitfalls

Letting the model write vague summaries, which makes it hard to score factual accuracy.
Using synthetic data without enough variation, which hides failure cases and inflates performance.
Mixing up supported facts and inferred claims, which makes hallucination scoring inconsistent.
Comparing models with different prompts or different input fields, which makes the benchmark unfair.
Ignoring privacy tradeoffs and only measuring accuracy, which weakens the engineering story.

What Makes This Competitive

A strong version of this project does more than compare two chatbots. You can build a careful benchmark, define a strict hallucination rubric, and test several input designs. You can also separate factual errors from harmless style changes, which makes your analysis sharper. If you add privacy, speed, and accuracy into one evaluation framework, the project starts to look much more like real biomedical engineering work.

Project Variations

Compare summaries from a phone-sized local model against a laptop-hosted local model to test how device limits affect accuracy.
Swap wearable-only input for wearable plus diet input to see whether extra context improves note quality.
Test whether structured tables, raw text logs, or mixed-format prompts produce the lowest hallucination rate.

Learn More

PubMed: Search for review articles on clinical natural language processing, hallucination in medical AI, and digital health summarization.
NIH Open Access: Find free full-text papers on health informatics, wearable data analysis, and patient-facing AI tools.
Nature Scientific Reports: Search for open-access papers on local language models, evaluation metrics, and medical text generation.
MIT OpenCourseWare: Look for courses on machine learning, data science, and natural language processing.
arXiv: Search for preprints on LLM evaluation, factual consistency, and on-device inference.

Biomedical Engineering Category Guide

How to Do Real Biomedical Engineering Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →