Wasserstein Dialect Boundary Detection

ISEF Category: Mathematics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

Languages change across space like colors blending on a map. Some shifts are smooth, and some form sharp borders. You can test where those borders appear in public atlas data. This project turns speech patterns into math you can measure.

What Is It?

This project asks a simple question with a math-heavy answer: where does one dialect region end and another begin? A dialect is a regional pattern in speech, like different ways people pronounce the same word or use the same phrase. Public linguistic atlas data records these patterns across many locations, so you can treat each place like a point on a map with a speech profile attached.

Wasserstein distance, also called optimal transport distance, measures how much work it takes to turn one distribution into another. A good analogy is moving piles of sand. If two speech profiles are very similar, the distance stays small. If they differ a lot, the distance grows. That gives you a way to compare nearby locations and look for boundary lines where speech changes quickly.

The entropic-regularization part helps compute Wasserstein distance faster and more smoothly. Regularization means adding a small adjustment that makes the math easier to solve. Your project can use that tool to build a significance test, which checks whether an apparent boundary is stronger than random noise or sampling variation.

Why This Is a Good Topic

This is a strong science fair topic because you can ask a clear, testable question with public data and serious math. You do not need a wet lab, but you do need careful thinking about distance measures, spatial patterns, and statistical testing. The topic connects to real problems in geography, linguistics, and data science. You can learn how to clean data, compare distributions, and test whether a pattern is real or just an artifact of the method.

Research Questions

How does Wasserstein distance compare with simple mismatch counts for detecting dialect boundaries in atlas data?
What is the effect of entropic regularization strength on boundary sharpness and statistical significance?
Does adding geographic distance improve boundary detection compared with speech features alone?
To what extent do different sets of linguistic variables change the location of the strongest boundary?
Which atlas regions produce the largest Wasserstein jumps between neighboring locations?
How does the boundary map change when you resample locations or leave out sparse data points?

Basic Materials

Laptop or desktop computer with enough memory for data analysis.
Spreadsheet software or CSV editor for cleaning atlas data.
Python installed with NumPy, Pandas, SciPy, Matplotlib, and POT or another optimal transport library.
Public US linguistic atlas dataset or downloadable dialect survey data.
Map data or county boundary shapefiles for visualizing locations.
Notebook for tracking variable definitions, code decisions, and model choices.

Advanced Materials

Laptop or workstation with more RAM for repeated resampling runs.
Python with GeoPandas, Shapely, SciPy, NumPy, Pandas, Matplotlib, and an optimal transport package such as POT.
GIS software for spatial visualization, such as QGIS.
Access to cleaned historical dialect atlas tables and coordinate metadata.
Version control system such as Git for tracking code changes.
High-resolution regional boundary layers for spatial overlays.

Software & Tools

Python: Runs the distance calculations, resampling tests, and visualizations.
Jupyter Notebook: Keeps code, notes, and plots together while you explore the data.
Pandas: Cleans atlas tables and joins speech variables with locations.
Matplotlib: Draws boundary maps, distance plots, and test results.
QGIS: Helps you compare dialect boundaries with geographic regions on a map.

Experiment Steps

Define one speech feature set and one geographic region, so your question stays narrow and testable.
Choose how you will represent each location as a distribution or feature vector before you compute distance.
Design a baseline comparison method, so you can judge whether Wasserstein distance adds anything new.
Plan a boundary score that turns pairwise distances into a map of sharp changes across space.
Build a significance test with resampling or permutation logic that checks whether the boundary exceeds chance patterns.
Decide how you will report uncertainty, compare regions, and explain where the method works or fails.

Common Pitfalls

Using too many linguistic variables at once, which makes the boundary signal hard to interpret.
Ignoring missing atlas points, which can create fake jumps where the data are sparse.
Comparing raw distances without a baseline, which makes it impossible to tell whether Wasserstein adds value.
Choosing a regularization setting without checking sensitivity, which can blur or exaggerate boundary lines.
Treating map smoothness as proof of a real dialect border, which confuses visualization with statistical evidence.

What Makes This Competitive

A stronger version of this project goes beyond making a pretty map. You would compare at least one standard boundary method against your Wasserstein approach, then test whether the result stays stable under resampling and missing-data stress tests. You could also show where entropic regularization helps or hurts detection. That kind of method comparison and uncertainty analysis reads like real research, not just a classroom demo.

Project Variations

Use vowel pronunciation data instead of vocabulary choice data, then test whether the boundary map changes.
Compare county-level boundaries with state-level boundaries to see whether the method is sensitive to spatial scale.
Replace one-distance thresholding with a full heat map of pairwise Wasserstein differences and cluster the strongest regions.

Learn More

PubMed: Search for review articles on optimal transport in data analysis and spatial statistics to see how the distance measure works in research settings.
NOAA National Centers for Environmental Information: Explore geographic data handling and mapping resources for spatial visualization.
USGS National Map: Find boundary layers and coordinate data for overlaying your dialect maps.
MIT OpenCourseWare, Introduction to Probability and Statistics: Use free course notes and lectures to review hypothesis testing and resampling ideas.
Journal articles in Computational Linguistics and Linguistics: Search journal sites or Google Scholar for open-access papers on dialect maps, language variation, and spatial methods.

Mathematics Category Guide

How to Do Real Mathematics Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →