RL Baseline Reproducibility in Robot Manipulation

ISEF Category: Robotics and Intelligent Machines

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

Two models can look tied on paper and still behave very differently when you rerun them. In robot learning, that gap can waste weeks and hide weak results. Your project asks a simple but powerful question, which method stays reliable when the random seed changes?

What Is It?

This project studies reproducibility in robot manipulation, where a robot learns tasks like picking up, pushing, or opening objects in a simulator. You will compare PPO, SAC, and Diffusion Policy, then check how much their results change when you keep the setup the same but change the random seed. A random seed is just the starting point for the algorithm's built-in randomness, like rolling the same dice again and again.

Think of it like testing three students on the same quiz, but letting each student retake it with different question order, different starting hints, and different practice examples. If one student always scores about the same, you trust the score more. If another student swings from great to terrible, you know the average score alone does not tell the whole story.

Why This Is a Good Topic

This is a strong science fair topic because it is measurable, comparison-based, and tied to a real problem in AI research. Training runs in robot learning can cost a lot of compute, and weak evaluation can make a method look better than it is. You can learn how to design fair experiments, track variance, and report uncertainty in a way that other teams can reuse.

Research Questions

How does the choice of random seed affect final success rate for PPO on MetaWorld tasks?
How does the choice of random seed affect final success rate for SAC on MetaWorld tasks?
How does the choice of random seed affect final success rate for Diffusion Policy on MetaWorld tasks?
What is the effect of fixed compute budget on the variance of training outcomes across PPO, SAC, and Diffusion Policy?
To what extent does task difficulty change the spread of scores across repeated runs?
Which summary statistic, mean, median, or best-of-n, gives the fairest comparison under limited compute?

Basic Materials

Laptop or desktop with a modern GPU or access to a shared compute cluster.
Python 3.10 or later.
MetaWorld benchmark environment.
PyTorch.
A code editor such as VS Code.
Spreadsheet software or Google Sheets for tracking runs.
Digital notebook for experiment logs.
External storage for saving checkpoints and results.

Advanced Materials

Workstation or university compute server with an NVIDIA GPU.
Linux environment for stable package management.
Docker or Conda environment for reproducible installs.
Git for version control.
PyTorch with CUDA support.
MetaWorld benchmark suite.
NumPy and pandas for analysis.
SciPy for statistical testing.
Matplotlib or Seaborn for plots.
Weights and Biases or TensorBoard for training logs.

Software & Tools

Python: Runs training, logging, and analysis scripts for the benchmark comparisons.
PyTorch: Implements the RL baselines and handles model training.
MetaWorld: Provides the manipulation tasks used for the reproducibility audit.
pandas: Organizes run-level metrics, seeds, and summary tables.
Matplotlib: Plots variance, confidence intervals, and comparison charts.

Experiment Steps

Define the exact MetaWorld tasks you will compare and fix the evaluation metric for every run.
Choose the one experimental variable you will change first, such as random seed, while holding compute and hyperparameters constant.
Set up a logging plan that records both training curves and final test performance for every baseline.
Plan a fairness check that uses the same budget, the same task suite, and the same stopping rule for all methods.
Build an analysis plan that reports spread, confidence intervals, and effect sizes, not just averages.
Decide how you will turn your results into a reusable low-cost protocol that another mentor can follow.

Common Pitfalls

Changing hyperparameters between PPO, SAC, and Diffusion Policy, which turns a reproducibility audit into a loose model comparison.
Reporting only the best run, which hides whether one method is stable or just lucky.
Mixing training seeds and evaluation seeds, which makes the source of variance impossible to trace.
Using different compute budgets for different baselines, which can make the cheapest method look unfairly weak or strong.
Skipping task-by-task analysis, which can hide that one method works only on easy MetaWorld tasks.

What Makes This Competitive

A strong version of this project goes past a simple leaderboard. You would predefine your metrics, separate training variance from evaluation variance, and report uncertainty with confidence intervals or bootstrap tests. You could also compare how different summary rules change the ranking of methods under the same budget. That kind of careful analysis shows you understand scientific measurement, not just model training.

Project Variations

Compare reproducibility across easy, medium, and hard MetaWorld tasks to see whether instability grows with task complexity.
Test whether a smaller, fixed compute budget changes which baseline looks strongest across repeated seeds.
Analyze whether confidence intervals or median performance give a fairer ranking than raw mean success rate.

Learn More

MetaWorld project page: Search for the official benchmark documentation and task descriptions to understand the manipulation suite.
OpenAI Spinning Up in Deep RL: A free educational resource that explains PPO and SAC, and where to find it through the OpenAI documentation pages.
PyTorch tutorials: Free official tutorials that help you build, train, and log neural network experiments.
PubMed: Search for review articles on reproducibility in machine learning and robotic manipulation to frame your research question.
MIT OpenCourseWare: Search for reinforcement learning lecture notes and assignments that explain policy gradients and off-policy learning.
arXiv: Search for recent papers on Diffusion Policy, robot manipulation benchmarks, and reproducibility studies.

Robotics and Intelligent Machines Category Guide

How to Do Real Robotics and Intelligent Machines Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →