Table of Contents
Fetching ...

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

David Snyder, Apurva Badithela, Nikolai Matni, George Pappas, Anirudha Majumdar, Masha Itkina, Haruki Nishimura

Abstract

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

Abstract

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.
Paper Structure (46 sections, 2 theorems, 31 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 46 sections, 2 theorems, 31 equations, 7 figures, 5 tables, 2 algorithms.

Key Result

Lemma 1

Consider the stochastic process $\{X_n\}$ defined in the_evidence_integrator, setting (w.l.o.g.) $X_0 = 1$. Then the expectation of $\{X_n\}$ is contracting in time with respect to the current value, for all $h \in S^{-}$ for any $\xi_n \in [0, 1]$. That is:

Figures (7)

  • Figure 1: The evaluation context of the N-SCORE procedure. (Left) We consider the general problem of policy comparison, which arises out of counterfactual design decisions in the policy synthesis process. (Middle) Evaluation on hardware or in high-fidelity simulation is the gold standard to assess the effect of such changes, but is costly to collect. (Right) N-SCORE is a sequential evaluation procedure that is statistically rigorous, sample efficient, and generalizes to rich, diverse measures of robot performance.
  • Figure 2: Policy performance comparisons on crowd-sourced real-world evaluations on the DROID khazatsky2024droid setup from RoboArena atreya2025roboarena. The violin plots represent empirical distributions of observed results. Policies with different letters are statistically distinguishable by the method. Policies are compared at a global error bound of $\alpha=0.05$ with a Bonferroni correction.
  • Figure A.1: Number of trials required to separate RL policies on InvertedPendulum-v4. This is one of the most stark instances of disparity between N-SCORE and WSR, the latter of which requires significantly more evaluation effort (by nearly a factor of three) to reach a decision.
  • Figure A.2: Violin plots and the number of samples required for N-SCORE and WSR on multi-policy comparison of RL policies on Mujoco benchmarks. Policies with different letters are statistically distinguishable by the method. Policies are compared at a global error bound of $\alpha = 0.05$ with a Bonferroni correction. In all cases, N-SCORE results in the same comparison conclusions as WSR with fewer samples, demonstrating its broadly improved efficiency. These results also serve as an alternate visualization of the time-to-decision results in \ref{['table_of_rl_data_all']}.
  • Figure A.3: Continuous task progress scores enable faster policy comparison on RoboArena policy evaluation data. Violin plots and time-to-decision for multi-policy comparison from RoboArena evaluations. Left: Policy comparison under continuous scores with N-SCORE. Right: policy comparison under Bernoulli scores with N-SCORE, WSR, and STEP. In the Bernoulli comparison setting, none of the methods are able to distinguish all the policies. Even though STEP is maximally efficient on Bernoulli comparisons, N-SCORE on continuous progress requires fewer total trials to distinguish policies. We emphasize that the partial credit and binary metrics arise from the same rollouts of each policy; the partial credit thus reflects precisely a more informative 'representation' of the rollout for evaluation purposes. The reduced time-to-decision and increased power (separating all of the policies successfully) highlights the fundamental advantage of fine-grained task progress scores over sparse binary success rates for efficient policy comparison. As can be observed in \ref{['fig:roboarena']} in the main text, N-SCORE also significantly reduces the time-to-decision with respect to WSR. The former requires approximately 1420 trials, while the latter needs an additional 450, while failing to distinguish $\pi_0$ from PG-Diff.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Definition 1: Generalized Progress Metric
  • Lemma 1: Null Stability (NSM) Property
  • Theorem 1: Type-1 Error Control of \ref{['the_nsm_algorithm']}
  • proof
  • Remark 1: Efficient Optimization of $\xi_n$
  • Definition 2: Nonnegative Supermartingale (NSM)
  • Remark 2: Linearity of $\underline{P}_{ij}$