Table of Contents
Fetching ...

Contrast Sets for Evaluating Language-Guided Robot Policies

Abrar Anwar, Rohan Gupta, Jesse Thomason

TL;DR

This work uses the relative performance change of different contrast set perturbations to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task.

Abstract

Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use the relative performance change of different contrast set perturbations to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.

Contrast Sets for Evaluating Language-Guided Robot Policies

TL;DR

This work uses the relative performance change of different contrast set perturbations to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task.

Abstract

Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use the relative performance change of different contrast set perturbations to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.
Paper Structure (20 sections, 8 figures, 1 algorithm)

This paper contains 20 sections, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Overview.Left: In standard test set evaluation, a test set is i.i.d. random sampled to cover the domain of possible language, scene and behaviors that a robot can execute. It can be expensive to reset the scene to each new test instance during experiments. Middle: In this work, we design contrast sets gardner2020evaluating for language-guided robot evaluation, comprising perturbation strategies based on the language, scene, and expected behavior of the robot. Right: The proposed contrast set evaluation allows experimenters to efficiently evaluate neighborhoods around original test instances.
  • Figure 2: Language-Table Rollouts. In the Language-Table simualtor langauge_table, we sample an evaluation set of 250 test instances that is sequentially evaluated. A test instance is sampled from one of five task types which manipulate blocks according to a task definition. The standard evaluation requires i.i.d. random sampling instructions and scenes, which accumulate more effort for the experimenter. Contrast set evaluation allows experimenters to perturb sampled test instances by making minimal changes after each execution, leading to less work for the experimenter.
  • Figure 3: Left: A key insight offered by contrast set evaluations is probing the strengths and weaknesses of a learned policy. The mean success-weighted path length (SPL) achieved over the full test set may compare average policy performances, but here we observe additional robustness to instruction source and target switches and source block starting position ($\Delta LB_1$,$\Delta SB_2$) but brittleness to direction word inversion ($\Delta LB_2$), providing insights for training and deployment. Right: Comparison of evaluation strategies' absolute estimation error of the SPL of the entire test set as a function of the cumulative cost in distance blocks are moved during scene resets. The maximum cost of the standard evaluation is 281, achieving the horizontal error line at 0.0, and we cap cost at 300, though additional perturbation instances are possible for some strategies. All perturbation strategies achieve better test set SPL estimates than a Limited Intervention baseline.
  • Figure 4: VLN-CE Robot Rollouts. The standard evaluation of i.i.d. random sampling scenes requires scenes to be shuffled around drastically. Intuitively, contrast sets allow experimenters to cheaply perturb sampled test instances to find new ecologically valid samples to evaluate.
  • Figure 5: Left: Contrast set evaluation probes the strengths of a trained VLN-CE model to a physical robot. We observe that the policy is robust to changes to the final goal instruction ($\Delta LB$) and physical changes to the goal itself ($\Delta SB$) as depicted by the 13% higher performance over the full contrast set. Right: Average cumulative progress to goal error (and cumulative std. dev.) vs. cumulative cost. The contrast set evaluation quickly reaches a nearly accurate estimate of the final test set progress to goal, showcasing the potential to reduce experimenter costs dramatically by exploring neighborhoods of contrast around test instances.
  • ...and 3 more figures