Fair Ranking Calibration for ML Models

Fair Ranking Calibration for ML Models

ISEF Category: Mathematics

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Probability and Statistics  ·  Difficulty: Advanced  ·  Setup: Home Setup  ·  Time: 1 to 2 Months

The Hook

A model can look fair and still rank people in a biased order. That matters when a score decides who gets a loan, a review, or extra scrutiny. You can study how a fairness metric changes the ranking itself, not just the final yes or no answer. That gives you a project that feels real, because the stakes are real.

What Is It?

This topic asks how you measure fairness when a machine learning model sorts people by score. Think of it like a class ranking system. A normal grade tells you who got what, but a ranking tells you who lands above or below everyone else. In the same way, a fairness metric for ranking checks whether the order produced by a model treats groups in a balanced way.

The key idea is calibration. In simple terms, calibration means a score matches reality. If a model gives two groups the same score range, those groups should have similar outcome rates. A fair-ranking calibration metric tries to keep that promise while also checking the order of predictions. Your project would study whether the metric behaves the way the theory says it should, then test it on public datasets such as FICO-style credit data or COMPAS-like recidivism data.

Why This Is a Good Topic

This is a strong science fair topic because you can turn a math idea into a measurable test. You can compare how different ranking rules change fairness, accuracy, and calibration. The project connects to real problems in lending, hiring, and criminal justice, so your results feel meaningful. You can also learn real research skills, like reading papers, cleaning data, and running statistical comparisons.

Research Questions

  • How does the fair-ranking calibration metric change when you sort predictions by different score thresholds??
  • What is the effect of group imbalance on the value of the fairness metric??
  • Does the metric stay stable when you bootstrap the dataset with repeated resampling??
  • To what extent do different models, such as logistic regression and random forest, produce different fairness scores??
  • Which ranking cutoff gives the largest gap between calibration and fairness??
  • How does removing one feature at a time change the metric on public credit and recidivism datasets??

Basic Materials

  • Laptop or desktop computer with at least 8 GB RAM.
  • Python installed through Anaconda or a similar free distribution.
  • Jupyter Notebook for writing and running code.
  • Public dataset from FICO-style credit data or COMPAS benchmark.
  • Spreadsheet software for quick checks and tables.
  • Notes document for tracking definitions, variables, and experiment choices.

Advanced Materials

  • Access to a university or school server for repeated model runs.
  • Python with pandas, NumPy, SciPy, scikit-learn, and statsmodels.
  • Jupyter Notebook or JupyterLab for analysis notebooks.
  • Git for version control and reproducibility tracking.
  • High-resolution monitor for comparing plots and calibration curves.
  • Optional access to R for cross-checking statistical results.

Software & Tools

  • Python: Runs data cleaning, model training, and fairness metric calculations.
  • Jupyter Notebook: Lets you document each analysis step beside the code.
  • pandas: Organizes the dataset and prepares features for testing.
  • scikit-learn: Builds baseline prediction models for comparison.
  • matplotlib: Makes plots for calibration curves, score distributions, and group comparisons.

Experiment Steps

  1. Define the fairness claim you want to test, then decide what counts as a ranking output in your dataset.
  2. Choose one public dataset and map each column to a clear meaning before you model anything.
  3. Translate the paper’s axioms into measurable checks so you can see whether the metric behaves as promised.
  4. Build at least one baseline model, then compare its ranking behavior against the new metric.
  5. Plan a fairness analysis that separates overall accuracy from group-level calibration, so you do not mix the two.
  6. Set up resampling or cross-validation, then compare whether your results stay consistent across splits.

Common Pitfalls

  • Using a dataset without checking how the labels were generated, which can make your fairness results hard to interpret.
  • Treating a score ranking as the same thing as a binary classification decision, which hides the actual metric behavior.
  • Comparing group averages without checking score calibration first, which can make a bad ranking look fair.
  • Testing only one model split, which makes the results too unstable for a strong conclusion.
  • Quoting the paper’s theory without reproducing the metric carefully, which can lead to a mismatch between your code and the definition.

What Makes This Competitive

A stronger version of this project would do more than repeat a published example. You could test whether the metric stays consistent across several datasets, models, and resampling methods. You could also compare it against other fairness measures and show where each one agrees or disagrees. That kind of careful analysis shows real mathematical judgment.

Project Variations

  • Apply the same fairness metric to different public credit datasets and compare whether the ranking behavior changes.
  • Swap in a different classifier, such as logistic regression, random forest, or gradient boosting, then see how the metric responds.
  • Test whether the metric changes more under class imbalance or under feature removal, then report which source of change matters most.

Learn More

  • PubMed: Search for review articles on algorithmic fairness, calibration, and bias in machine learning.
  • arXiv: Search for recent preprints on fair ranking, calibration, and group fairness metrics.
  • MIT OpenCourseWare: Look for probability, statistics, and machine learning course notes that cover model evaluation.
  • scikit-learn User Guide: Read the free documentation for model training, cross-validation, and performance metrics.
  • U.S. National Institute of Standards and Technology (NIST): Search for guidance on algorithmic bias, measurement, and evaluation frameworks.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Hub →

Shopping Cart