Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection
Abrar Anwar, Rohan Gupta, Zain Merchant, Sayan Ghosh, Willie Neiswanger, Jesse Thomason
TL;DR
This work tackles the high cost of evaluating many robot policies across numerous tasks by casting evaluation as a population-parameter estimation problem and solving it with active, cost-aware testing. A surrogate model, conditioned on task and policy embeddings (including language-based priors for tasks), predicts distribution parameters $\theta_{ij}$ for each policy-task pair, while BALD-based expected information gain guides the selection of informative experiments under switching-cost penalties. Across offline datasets (HAMSTER, OpenVLA, MetaWorld), cost-aware active sampling improves mean-parameter estimation and demonstrates when task representations (notably language-informed embeddings) enhance efficiency. The approach offers a scalable pathway to rigorous, multi-task robot evaluation as the space of policies and tasks expands, with practical implications for benchmarking and model selection under real-world experimental constraints.
Abstract
Evaluating learned robot control policies to determine their physical task-level capabilities costs experimenter time and effort. The growing number of policies and tasks exacerbates this issue. It is impractical to test every policy on every task multiple times; each trial requires a manual environment reset, and each task change involves re-arranging objects or even changing robots. Naively selecting a random subset of tasks and policies to evaluate is a high-cost solution with unreliable, incomplete results. In this work, we formulate robot evaluation as an active testing problem. We propose to model the distribution of robot performance across all tasks and policies as we sequentially execute experiments. Tasks often share similarities that can reveal potential relationships in policy behavior, and we show that natural language is a useful prior in modeling these relationships between tasks. We then leverage this formulation to reduce the experimenter effort by using a cost-aware expected information gain heuristic to efficiently select informative trials. Our framework accommodates both continuous and discrete performance outcomes. We conduct experiments on existing evaluation data from real robots and simulations. By prioritizing informative trials, our framework reduces the cost of calculating evaluation metrics for robot policies across many tasks.
