Dialect Voice Conversion for Folk Tales Project

ISEF Category: Technology Enhances the Arts

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Human Information Exchange · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A voice can carry identity, rhythm, and place, not just words. That matters when you try to read a folk tale in a dialect that has very little training data. Your project asks a real question, can a small amount of audio teach a model to keep the speaker sounding natural while changing the dialect?

What Is It?

This project sits at the meeting point of speech tech and language preservation. RVC, which stands for retrieval-based voice conversion, tries to change how a voice sounds without changing the words. Whisper is a speech recognition model that can help you check what was said. Put together, these tools can help you test whether a voice model can make one speaker sound like they are speaking in a regional dialect while still keeping the same storytelling style.

Think of it like dubbing, but with a twist. You are not replacing the whole voice with a new one. You are asking the model to repaint the same voice with a different linguistic accent and rhythm. Prosody means the music of speech, like stress, pitch, and timing. If your model preserves prosody well, the story sounds more human and less robotic.

Why This Is a Good Topic

This is a strong science fair topic because you can test a clear input, a clear output, and measurable quality. You can vary training data size, dialect, speaker similarity, or audio preprocessing, then compare results with native-speaker ratings and objective speech measures. The project connects to language access, digital preservation, and speech technology, so it has a real-world purpose. You can also learn model training, evaluation design, and how to turn human judgments into data.

Research Questions

How does training audio length affect native-speaker MOS ratings for dialect-converted speech?
How does dialect distance affect preserved prosody in converted folk tale readings?
Does adding Whisper-based transcript filtering improve naturalness scores for low-resource dialect models?
To what extent does speaker gender matching change perceived authenticity in converted speech?
Which audio preprocessing method gives the best tradeoff between intelligibility and dialect naturalness?
How does text length in folk tale passages affect conversion quality and pronunciation stability?
What is the effect of using different reference speakers on similarity and listener ratings?

Basic Materials

Laptop or desktop computer with a modern GPU or access to a school workstation.
Clean voice recordings in the target dialect, with speaker consent and simple metadata.
Reference speech recordings from the same speaker in a different language or dialect.
Headphones for listening tests.
Quiet room for recording and rating sessions.
Spreadsheet software for tracking ratings and trial conditions.
Digital audio editor such as Audacity for trimming and checking clips.
Online survey form for MOS ratings from native speakers.

Advanced Materials

Workstation with an NVIDIA GPU and enough VRAM for model training.
High-quality USB microphone or condenser microphone with audio interface.
Pop filter and microphone stand for consistent recordings.
Acoustic treatment for a quiet recording space.
Python environment with speech processing libraries.
Local storage for multiple model checkpoints and audio datasets.
Annotation tool for labeling clips by speaker, dialect, and cleanliness.
Statistical analysis software for mixed-effects models or rating analysis.

Software & Tools

Python: Runs your preprocessing, training scripts, and evaluation code.
Audacity: Edits audio clips, removes obvious noise, and checks clip length.
Whisper: Transcribes speech so you can verify the spoken content and filter bad clips.
ImageJ: Not used for audio, so skip it for this project and focus on speech tools instead.
LibreOffice Calc: Organizes your trial data, ratings, and summary tables.

Experiment Steps

Define one speech task, one dialect target, and one evaluation metric before you train anything.
Collect or curate a small, clean dataset, then build rules for clip quality and speaker consent.
Plan a baseline model and at least one comparison condition, such as different training audio amounts or preprocessing methods.
Set up an evaluation plan that combines native-speaker MOS ratings with at least one objective speech measure.
Decide how you will check whether the model preserved wording, dialect features, and prosody, not just vocal timbre.
Pre-register your analysis choices, then run the same pipeline across every condition so comparisons stay fair.

Common Pitfalls

Using mismatched recording quality across dialect samples, which makes the model learn microphone noise instead of speech patterns.
Mixing speakers without strict labeling, which blurs voice identity and weakens conversion quality.
Choosing evaluation clips that are too short or too easy, which inflates MOS scores and hides failures.
Asking only one or two listeners for ratings, which gives unstable results and weak statistics.
Skipping transcript checks with Whisper or manual review, which lets mislabeled or off-target audio contaminate training.

What Makes This Competitive

A competitive version goes past “does it work” and asks “how well, for whom, and under what limits.” You need careful controls, like comparing audio length, dialect pairings, and preprocessing methods. Strong projects also separate listener bias from model quality by using blind ratings and solid statistics. If you connect human perception with objective speech metrics, your analysis becomes much stronger.

Project Variations

Test whether the same pipeline works better for tonal dialects than for non-tonal dialects.
Compare folk tale narration with everyday conversational speech to see which style converts more naturally.
Measure whether native-speaker listeners rate intelligibility and authenticity differently for male and female source voices.

Learn More

MIT OpenCourseWare Speech and Language Processing materials: Search MIT OpenCourseWare for lecture notes on speech processing and voice conversion.
NIH PubMed: Search for review articles on speech synthesis, voice conversion, and perceptual evaluation methods.
ACL Anthology: Find peer-reviewed papers on voice conversion, low-resource speech, and evaluation design.
NOAA National Center for Environmental Information: Use background audio science resources if you need help understanding recording noise and signal quality.
NASA Open Data and software resources: Look for signal processing examples and Python workflow ideas that transfer to audio analysis.

Technology Enhances the Arts Category Guide

How to Do Real Technology Enhances the Arts Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →