Theater Captioning With Speaker Tags and Live AI Text

ISEF Category: Technology Enhances the Arts

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Human Information Exchange · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Most live captions only tell you what was said. They do not always tell you who said it. In a play, that gap can wreck the story flow. A caption system that tags each line to the right actor could make theater easier to follow for deaf and hard-of-hearing audiences.

What Is It?

This project studies a live captioning system for theater that tries to do two jobs at once. First, it turns speech into text with automatic speech recognition, which means software that converts audio into words. Second, it tries to match each line to the right speaker using diarization, which means splitting audio by who is talking, plus face recognition from a webcam.

Think of it like a subtitle stream with a name tag attached. Normal captions are like reading dialogue with the speaker labels removed. Your system asks whether adding speaker labels, and putting the text on a seat-back display, helps people follow the plot better, feel more immersed, and miss fewer lines.

Why This Is a Good Topic

This makes a strong science fair topic because you can test both the technical side and the human side. You can compare caption styles, measure accuracy, and study how different display formats affect comprehension and comfort. The real-world problem is clear, live theater access for deaf and hard-of-hearing audiences. You can learn about speech recognition, human-computer interaction, and experimental design without needing to invent a brand-new AI model from scratch.

Research Questions

How does speaker-tagged captioning affect comprehension of dialogue in a short theater scene?
What is the effect of seat-back display placement on reading speed and scene recall?
Does adding face-based speaker attribution reduce confusion when multiple actors speak in quick turns?
To what extent does caption style change viewer-reported immersion during a performance clip?
Which caption format produces the fewest missed line attributions when actors overlap or interrupt each other?
How does line length in captions affect comprehension when captions appear in real time?

Basic Materials

Laptop with webcam and microphone, with enough processing power for speech recognition demos.
A sample theater script with clearly marked speakers.
Pre-recorded dialogue clips from a short scene.
Headphones for testing audio conditions.
Printed comprehension quiz sheets or online survey forms.
A simple projected or monitor-based subtitle display.
Consent forms for any human participants.
Spreadsheet software for recording scores and survey results.

Advanced Materials

University-grade microphone array for cleaner speaker separation.
Multiple webcams for improved face tracking and actor identification.
Semi-transparent OLED display or similar transparent display prototype.
Dedicated GPU workstation for running speech and vision models.
Audio interface for controlled playback and recording.
Screen capture software for timing and overlay validation.
Annotation tool for labeling speaker turns and caption errors.
Statistical software for mixed-effects or repeated-measures analysis.

Software & Tools

OpenAI Whisper: Converts theatrical audio into text for live or near-live captioning experiments.
pyannote.audio: Performs speaker diarization so you can separate who is talking.
OpenCV: Detects faces and tracks actors in camera video.
ImageJ: Helps measure display visibility and contrast in captured test images.
Python: Lets you connect audio, video, caption output, and data analysis in one workflow.

Experiment Steps

Define the one experience you will measure first, such as comprehension, immersion, or speaker tracking accuracy.
Choose a comparison setup, such as speaker-tagged captions versus standard scrolling captions.
Build an evaluation plan that separates text accuracy, speaker attribution accuracy, and audience response.
Design controls for scene length, actor count, speech overlap, and display position.
Plan how you will score participant answers and convert performance into numbers you can compare.
Decide which failure cases matter most, such as missed speakers, delayed captions, or hard-to-read placement.

Common Pitfalls

Measuring only word accuracy, which misses whether the system assigns each line to the right actor.
Testing with clean audio only, which hides how badly the model fails during overlap, laughter, or stage noise.
Ignoring caption delay, which can make a perfectly accurate line feel useless in real time.
Using one scene style only, which makes the results too narrow to generalize to other performances.
Skipping participant feedback on readability and immersion, which leaves out the audience experience the project is trying to improve.

What Makes This Competitive

A stronger version of this project does more than build a demo. It compares multiple caption designs, measures both technical accuracy and audience response, and uses a careful study design. You could test how speaker tags, display location, or timing changes comprehension across different scene types. Strong statistical analysis and a clear accessibility question would push the work toward a more competitive level.

Project Variations

Test the system on musical theater clips instead of straight dialogue to see how singing changes caption accuracy and comprehension.
Compare face-based speaker attribution with audio-only diarization to find out which method better handles off-screen or masked actors.
Study whether captions help more when they appear on a seat-back display, a handheld device, or a projected screen near the stage.

Learn More

MIT OpenCourseWare, Introduction to Digital Speech Processing: Search MIT OpenCourseWare for speech recognition and audio processing lectures.
NIH PubMed: Search review articles on captioning, accessibility, and audiovisual comprehension.
NIST Speech and Language Resources: Find evaluation methods and benchmarks for speech technologies.
PyAnnote Audio documentation: Read about speaker diarization methods and example workflows.
OpenCV documentation: Learn how to detect faces, track motion, and connect video frames to your captioning pipeline.
Human Factors journal: Search for articles on subtitle readability, accessibility, and user attention in live media.

Technology Enhances the Arts Category Guide

How to Do Real Technology Enhances the Arts Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →