How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

Joseph A. Vincent; Haruki Nishimura; Masha Itkina; Paarth Shah; Mac Schwager; Thomas Kollar

How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

Joseph A. Vincent, Haruki Nishimura, Masha Itkina, Paarth Shah, Mac Schwager, Thomas Kollar

TL;DR

This work presents a framework that provides a tight lower-bound on robot performance in an arbitrary environment, using a minimal number of experimental policy rollouts, and provides a worst-case bound on the entire distribution of performance (via bounds on the cumulative distribution function) for a given task.

Abstract

With the rise of stochastic generative models in robot policy learning, end-to-end visuomotor policies are increasingly successful at solving complex tasks by learning from human demonstrations. Nevertheless, since real-world evaluation costs afford users only a small number of policy rollouts, it remains a challenge to accurately gauge the performance of such policies. This is exacerbated by distribution shifts causing unpredictable changes in performance during deployment. To rigorously evaluate behavior cloning policies, we present a framework that provides a tight lower-bound on robot performance in an arbitrary environment, using a minimal number of experimental policy rollouts. Notably, by applying the standard stochastic ordering to robot performance distributions, we provide a worst-case bound on the entire distribution of performance (via bounds on the cumulative distribution function) for a given task. We build upon established statistical results to ensure that the bounds hold with a user-specified confidence level and tightness, and are constructed from as few policy rollouts as possible. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware. Specifically, we (i) empirically validate the guarantees of the bounds in simulated manipulation settings, (ii) find the degree to which a learned policy deployed on hardware generalizes to new real-world environments, and (iii) rigorously compare two policies tested in out-of-distribution settings. Our experimental data, code, and implementation of confidence bounds are open-source.

How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

TL;DR

Abstract

Paper Structure (16 sections, 18 equations, 8 figures, 1 table)

This paper contains 16 sections, 18 equations, 8 figures, 1 table.

Introduction
Related Work
Formal Evaluation of Learned Policies
Statistical Evaluation of Learned Policies
Distributional Bounds
Bounds
Assumptions
Confidence Bounds - Binary Metric
Optimality Criteria
Optimal Binomial Bounds
Confidence Bounds - Continuous Metric
Experiments
Simulation Results
Generalization of a Single Policy
Comparing the Generalization of Two Policies
...and 1 more sections

Figures (8)

Figure 1: Our approach to evaluating BC policies. First, policy rollouts are collected in the environment of interest. Second, each rollout results in either a binary or continuous performance measurement. Third, statistical tools are used to compute an upper confidence bound on the CDF of performance. Finally, the user interprets the confidence bound and chooses to deploy or retrain the policy.
Figure 2: Hypothetical distributions of a lower confidence bound $\underline{p}$ for an unknown probability of success $p$. High confidence levels give better chances that the $\underline{p}$ we obtain is lower than $p$ (green shaded region), and tighter bounds give better chances that $\underline{p}$ is close to $p$.
Figure 3: Variation of MES, $n$, and $\alpha$. Our method is always tighter than Clopper-Pearson, and appreciably so at small sample sizes.
Figure 4: Variation of $\epsilon^*$, $n$, and $\alpha$. Our method is always tighter than the DKW bound, and appreciably so at small sample sizes.
Figure 5: Simulation tasks: can, lift, square, tool hang, transport zhu2020robosuite.
...and 3 more figures

Theorems & Definitions (1)

Definition 1: Uniformly Most Accurate, Eq. 3.22 of lehmann_textbook

How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

TL;DR

Abstract

How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)