Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar
TL;DR
SureSim provides finite-sample valid confidence intervals for real-world policy performance by augmenting limited real evaluations with large-scale simulations through a real2sim mapping $g$ and prediction-powered inference. By combining paired real/simulation evaluations with additional simulations via the UniformPPI and 2-stage PPI estimators, and using the Waudby-Smith and Ramdas non-asymptotic CI method, the approach yields reliable bounds on $\mu^* = \mathbb{E}_{X \sim \mathcal{D}_{\text{env}}}[Y(X)]$. In experiments with diffusion policies and the robot foundation model $\pi_0$, SureSim achieves 20-25% hardware evaluation savings under moderate to high real-simulation correlation, while controlling for type-I error. The work addresses the simulation-to-real gap in robotic manipulation, enabling scalable, trustworthy evaluation across diverse objects and initial conditions, though its benefits diminish when correlation is low or the gap is too large. Future directions include improved simulators, active sampling strategies, and larger, diverse evaluation datasets to broaden applicability.
Abstract
Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned \(π_0\) on a joint distribution of objects and initial conditions, and find that our approach saves over \(20-25\%\) of hardware evaluation effort to achieve similar bounds on policy performance.
