Table of Contents
Fetching ...

Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar

TL;DR

SureSim provides finite-sample valid confidence intervals for real-world policy performance by augmenting limited real evaluations with large-scale simulations through a real2sim mapping $g$ and prediction-powered inference. By combining paired real/simulation evaluations with additional simulations via the UniformPPI and 2-stage PPI estimators, and using the Waudby-Smith and Ramdas non-asymptotic CI method, the approach yields reliable bounds on $\mu^* = \mathbb{E}_{X \sim \mathcal{D}_{\text{env}}}[Y(X)]$. In experiments with diffusion policies and the robot foundation model $\pi_0$, SureSim achieves 20-25% hardware evaluation savings under moderate to high real-simulation correlation, while controlling for type-I error. The work addresses the simulation-to-real gap in robotic manipulation, enabling scalable, trustworthy evaluation across diverse objects and initial conditions, though its benefits diminish when correlation is low or the gap is too large. Future directions include improved simulators, active sampling strategies, and larger, diverse evaluation datasets to broaden applicability.

Abstract

Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned \(π_0\) on a joint distribution of objects and initial conditions, and find that our approach saves over \(20-25\%\) of hardware evaluation effort to achieve similar bounds on policy performance.

Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

TL;DR

SureSim provides finite-sample valid confidence intervals for real-world policy performance by augmenting limited real evaluations with large-scale simulations through a real2sim mapping and prediction-powered inference. By combining paired real/simulation evaluations with additional simulations via the UniformPPI and 2-stage PPI estimators, and using the Waudby-Smith and Ramdas non-asymptotic CI method, the approach yields reliable bounds on . In experiments with diffusion policies and the robot foundation model , SureSim achieves 20-25% hardware evaluation savings under moderate to high real-simulation correlation, while controlling for type-I error. The work addresses the simulation-to-real gap in robotic manipulation, enabling scalable, trustworthy evaluation across diverse objects and initial conditions, though its benefits diminish when correlation is low or the gap is too large. Future directions include improved simulators, active sampling strategies, and larger, diverse evaluation datasets to broaden applicability.

Abstract

Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned on a joint distribution of objects and initial conditions, and find that our approach saves over of hardware evaluation effort to achieve similar bounds on policy performance.

Paper Structure

This paper contains 21 sections, 1 theorem, 5 equations, 19 figures, 2 tables.

Key Result

Theorem 1

SureSim and its variants return finite-sample valid confidence interval $CI$ that satisfies eq:guarantee.

Figures (19)

  • Figure 1: Our goal is to evaluate a policy by computing bounds on its mean real-world performance on a diverse environment distribution $\mathcal{D}_{\text{env}}$. We present a framework that augments real-world evaluations with simulation evaluations to provide stronger inferences on real-world policy performance that could otherwise only be obtained by scaling up real-world evaluations.
  • Figure 2: Objects used for real-world evaluations.
  • Figure 3: Initial conditions for evaluation experiments
  • Figure 4: Evaluating Diffusion Policy with $n=60$ paired trials and up to $700$ additional simulations.
  • Figure 5: Average number of hardware trials saved compared to Classical, computed over $100$ random draws of data. Error bars indicate standard error of the mean savings.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof