ML Paper Reproducibility Checker Project

ML Paper Reproducibility Checker Project

ISEF Category: Systems Software

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A lot of ML papers look solid until someone tries to rerun them. Then the numbers change, the code breaks, or the model never matches the paper. Your project can test that gap directly. You can build a system that checks whether a paper really works outside the authors' notebook.

What Is It?

This project is about reproducibility, which means another person can run the same method and get similar results. In machine learning, that sounds simple, but it often is not. A paper may claim a result, while the code, data links, random seeds, or runtime setup hide the real difficulty. Your checker would try to catch that mismatch.

Think of it like a recipe scanner for research papers. If the paper says, "Bake for 20 minutes," but leaves out the oven temperature, the recipe is incomplete. Your system would read the paper, find the linked code, try to rerun the declared experiments in a sandbox like Colab, and then score how close the rerun came to the reported result. The public leaderboard part adds a ranking layer, so papers can be compared by how reproducible they are, not just by how high their accuracy looks.

Why This Is a Good Topic

This is a strong science fair topic because it asks a clear question with real software methods. You can measure whether a paper's claims match its code and rerun results, which gives you concrete outputs instead of vague opinions. The topic connects to a real problem in science and tech, since poor reproducibility wastes time, money, and trust. You can also learn source parsing, automation, sandboxing, evaluation metrics, and experiment design.

Research Questions

  • How does the type of paper task, such as classification, detection, or generation, affect rerun success??
  • What is the effect of missing or incomplete code comments on reproducibility score??
  • Does adding a fixed environment file improve the match between reported and rerun results??
  • To what extent do papers with public datasets reproduce better than papers with private or manually collected data??
  • Which paper features, such as hyperparameter detail, code structure, or seed control, best predict leaderboard score??
  • How does sandbox choice, such as Colab versus local runtime, change the fraction of papers that rerun successfully??

Basic Materials

  • A laptop or desktop computer with a modern web browser.
  • Free Google Colab access.
  • Python 3.10 or newer.
  • Git and GitHub account.
  • Text editor or code editor such as VS Code.
  • Spreadsheet software for tracking paper metadata and rerun results.
  • Access to arXiv search and paper PDFs.
  • Free PubMed-style literature tracking is not needed here, but a citation manager can help organize sources.

Advanced Materials

  • University workstation or access to a stronger GPU server.
  • Docker for environment isolation.
  • Hugging Face datasets or similar public ML datasets.
  • PyTorch or TensorFlow, depending on the target papers.
  • API access to arXiv metadata and GitHub repositories.
  • Container scanning tools for dependency inspection.
  • Database system for storing paper features and scores.
  • Logging and monitoring tools for failed reruns and runtime errors.

Software & Tools

  • Python: Automates paper retrieval, parsing, reruns, and scoring logic.
  • arXiv API: Pulls metadata, abstracts, and source links for candidate papers.
  • GitHub API: Finds linked repositories, commits, and repository metadata.
  • Google Colab: Provides a free sandbox for trial reruns on a limited compute tier.
  • Jupyter Notebook: Helps you inspect failures, compare outputs, and document results.

Experiment Steps

  1. Define the reproducibility score you will measure, such as rerun success, metric match, or setup completeness.
  2. Choose a narrow paper set first, so you can test the pipeline on a manageable sample.
  3. Build a parser that collects paper metadata, linked code, and reported experimental claims.
  4. Design a sandbox workflow that records whether each experiment can run without manual rescue.
  5. Create a scoring rubric that separates code availability, environment clarity, and result match.
  6. Plan a leaderboard format that ranks papers by reproducibility and explains why each score changed.

Common Pitfalls

  • Trying to rerun papers with missing datasets, which makes failures look like model problems when the real issue is data access.
  • Comparing your rerun to the paper without matching the same metric, which gives false disagreement.
  • Letting one paper use hidden manual fixes, which ruins the fairness of the leaderboard.
  • Mixing papers from very different ML tasks in one score, which makes the ranking hard to interpret.
  • Treating a code clone as reproducible proof, even when the repository does not match the paper version.

What Makes This Competitive

A competitive version goes beyond a simple success or fail list. You can build a clear scoring system, separate different failure modes, and compare paper types with real statistics. Strong projects often show that certain documentation patterns predict rerun success, or that one sandboxing method catches more problems than another. If you add careful error logging and a fair leaderboard design, your project starts to look like a real research tool, not just a script.

Project Variations

  • Focus only on arXiv papers in one ML subfield, such as vision or NLP, so you can compare reproducibility across a tight domain.
  • Replace the leaderboard with a failure taxonomy that labels issues as dependency, data, code, or metric mismatch.
  • Compare two sandbox styles, such as Colab and Docker, to see which one reproduces paper claims more faithfully.

Learn More

  • arXiv API documentation: Search the arXiv help pages for metadata and source access details.
  • GitHub REST API documentation: Find repository and commit metadata for linked research code.
  • Google Colab documentation: Learn the limits and workflow of the free notebook runtime.
  • Machine Learning Reproducibility Checklist papers: Search peer-reviewed articles and conference proceedings for reproducibility checklists and scoring methods.
  • NIH PubMed: Search review articles on reproducibility, evaluation bias, and research software practices.
  • MIT OpenCourseWare: Search computer science courses on software engineering, systems design, and machine learning workflows.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart