Berry-Esseen Bounds for Wage Inequality Studies

Berry-Esseen Bounds for Wage Inequality Studies

ISEF Category: Mathematics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Probability and Statistics  ·  Difficulty: Advanced  ·  Setup: Home Setup  ·  Time: Full Year

The Hook

Big data can still lie to you when the data have heavy tails. A few extreme wages can swing a summary statistic more than you expect. This project asks how fast a statistic starts to behave like a normal curve, even when the data are messy. You will test that idea with real census data and simulation.

What Is It?

A U-statistic is a type of summary number built from pairs or groups of observations. Think of it like a team score instead of a solo score. Many statistics used in inequality studies, like rank-based comparisons and pairwise averages, fit this pattern.

The Berry-Esseen bound tells you how close a sampling distribution is to a normal curve for a finite sample. A finite-sample version gives a practical error estimate, not just a vague long-run claim. Heavy-tailed kernels mean the data can contain rare but very large values, like extreme wages, which make the approximation harder. Your project studies how well the normal approximation works when the sample is not perfect and the tails are thick.

Why This Is a Good Topic

This topic is a strong science fair choice because you can test real mathematics with real data. You can compare theory, simulation, and bootstrap results, which gives you more than one way to measure success. The project connects to wage inequality, a real social issue, so your math has a clear use. You can also scale it to your level, from simulation only to simulation plus public IPUMS data.

Research Questions

  • How does sample size affect the accuracy of the normal approximation for a heavy-tailed U-statistic? ?
  • What is the effect of tail heaviness on the Berry-Esseen error bound for a wage-related U-statistic? ?
  • Does trimming extreme wage values improve the finite-sample bound without changing the statistic too much? ?
  • To what extent do bootstrap confidence intervals match the theoretical normal approximation on IPUMS wage data? ?
  • Which U-statistic kernel gives the most stable estimate of wage inequality under heavy-tailed data? ?
  • How does the bound change when you compare different subgroups defined by education, sex, or age? ?

Basic Materials

  • Laptop with spreadsheet software or Python access.
  • Public IPUMS census data account and downloaded extract.
  • Calculator or spreadsheet for summary statistics.
  • Python or R installed on a personal computer.
  • Graph paper or note-taking app for planning variables.
  • Basic statistics reference, such as an intro probability textbook or lecture notes.

Advanced Materials

  • Access to IPUMS microdata extract files.
  • Python with NumPy, pandas, SciPy, and statsmodels.
  • R with boot package or related resampling tools.
  • A scriptable environment such as Jupyter Notebook.
  • Optional high-performance laptop or campus computing access for large bootstrap runs.
  • Access to a statistics textbook or graduate lecture notes on U-statistics.

Software & Tools

  • Python: Runs simulations, computes U-statistics, and compares empirical error to theory.
  • R: Handles resampling workflows and makes statistical plots.
  • Jupyter Notebook: Keeps code, notes, and figures in one place.
  • pandas: Organizes IPUMS microdata and subgroup filters.
  • ImageJ: Not needed for this topic, so skip it unless you use it for a poster figure workflow.

Experiment Steps

  1. Define the exact U-statistic you want to study and write down why it measures wage inequality.
  2. Choose the one feature you will treat as heavy-tailed, such as wage, log wage, or a transformed wage gap.
  3. Set up a simulation plan that compares small, medium, and large sample sizes.
  4. Build a baseline normal approximation and decide how you will measure error against simulated sampling distributions.
  5. Plan a bootstrap comparison so you can test whether resampling agrees with the theoretical bound.
  6. Design subgroup checks on IPUMS data so you can see whether the bound changes across populations.

Common Pitfalls

  • Using raw wage values with no cleaning, which lets impossible or top-coded values dominate the statistic.
  • Mixing up the kernel definition and the final U-statistic, which makes the bound meaningless.
  • Comparing theory to one random simulation run, which hides the true sampling error.
  • Ignoring how IPUMS survey weights or subgroup filters change the data distribution.
  • Treating a bootstrap result as proof of the bound instead of one check on the approximation.

What Makes This Competitive

A stronger project will not stop at one plot. You can compare several kernels, several subgroup choices, and several tail treatments, then report where the approximation succeeds or fails. You can also quantify error with more than one metric, such as Kolmogorov distance, coverage error, or standardized residuals. A careful study that combines theory, simulation, and public data analysis can read like a small research paper, not a classroom demo.

Project Variations

  • Study how the bound changes when you use wage, log wage, or wage rank as the kernel input.
  • Compare men, women, and different age groups to see whether tail behavior changes the approximation quality.
  • Replace the census outcome with another public IPUMS variable, such as hours worked or family income, and test the same bound.

Learn More

  • IPUMS USA: Search for public microdata extracts and documentation on wage variables, weights, and top-coding rules.
  • NIH PubMed: Search for review articles on U-statistics, Berry-Esseen bounds, and heavy-tailed sampling.
  • MIT OpenCourseWare: Find probability and statistics lecture notes that cover asymptotic normality and resampling.
  • Annals of Statistics: Search for peer-reviewed papers on finite-sample bounds, U-statistics, and bootstrap theory.
  • NIST Engineering Statistics Handbook: Read the sections on distributions, bootstrap methods, and interpreting sampling error.
  • Census Bureau: Use public documentation for income and wage definitions, survey design, and variable coding.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart