Table of Contents
Fetching ...

Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

Abrar Anwar, Rohan Gupta, Zain Merchant, Sayan Ghosh, Willie Neiswanger, Jesse Thomason

TL;DR

This work tackles the high cost of evaluating many robot policies across numerous tasks by casting evaluation as a population-parameter estimation problem and solving it with active, cost-aware testing. A surrogate model, conditioned on task and policy embeddings (including language-based priors for tasks), predicts distribution parameters $\theta_{ij}$ for each policy-task pair, while BALD-based expected information gain guides the selection of informative experiments under switching-cost penalties. Across offline datasets (HAMSTER, OpenVLA, MetaWorld), cost-aware active sampling improves mean-parameter estimation and demonstrates when task representations (notably language-informed embeddings) enhance efficiency. The approach offers a scalable pathway to rigorous, multi-task robot evaluation as the space of policies and tasks expands, with practical implications for benchmarking and model selection under real-world experimental constraints.

Abstract

Evaluating learned robot control policies to determine their physical task-level capabilities costs experimenter time and effort. The growing number of policies and tasks exacerbates this issue. It is impractical to test every policy on every task multiple times; each trial requires a manual environment reset, and each task change involves re-arranging objects or even changing robots. Naively selecting a random subset of tasks and policies to evaluate is a high-cost solution with unreliable, incomplete results. In this work, we formulate robot evaluation as an active testing problem. We propose to model the distribution of robot performance across all tasks and policies as we sequentially execute experiments. Tasks often share similarities that can reveal potential relationships in policy behavior, and we show that natural language is a useful prior in modeling these relationships between tasks. We then leverage this formulation to reduce the experimenter effort by using a cost-aware expected information gain heuristic to efficiently select informative trials. Our framework accommodates both continuous and discrete performance outcomes. We conduct experiments on existing evaluation data from real robots and simulations. By prioritizing informative trials, our framework reduces the cost of calculating evaluation metrics for robot policies across many tasks.

Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

TL;DR

This work tackles the high cost of evaluating many robot policies across numerous tasks by casting evaluation as a population-parameter estimation problem and solving it with active, cost-aware testing. A surrogate model, conditioned on task and policy embeddings (including language-based priors for tasks), predicts distribution parameters for each policy-task pair, while BALD-based expected information gain guides the selection of informative experiments under switching-cost penalties. Across offline datasets (HAMSTER, OpenVLA, MetaWorld), cost-aware active sampling improves mean-parameter estimation and demonstrates when task representations (notably language-informed embeddings) enhance efficiency. The approach offers a scalable pathway to rigorous, multi-task robot evaluation as the space of policies and tasks expands, with practical implications for benchmarking and model selection under real-world experimental constraints.

Abstract

Evaluating learned robot control policies to determine their physical task-level capabilities costs experimenter time and effort. The growing number of policies and tasks exacerbates this issue. It is impractical to test every policy on every task multiple times; each trial requires a manual environment reset, and each task change involves re-arranging objects or even changing robots. Naively selecting a random subset of tasks and policies to evaluate is a high-cost solution with unreliable, incomplete results. In this work, we formulate robot evaluation as an active testing problem. We propose to model the distribution of robot performance across all tasks and policies as we sequentially execute experiments. Tasks often share similarities that can reveal potential relationships in policy behavior, and we show that natural language is a useful prior in modeling these relationships between tasks. We then leverage this formulation to reduce the experimenter effort by using a cost-aware expected information gain heuristic to efficiently select informative trials. Our framework accommodates both continuous and discrete performance outcomes. We conduct experiments on existing evaluation data from real robots and simulations. By prioritizing informative trials, our framework reduces the cost of calculating evaluation metrics for robot policies across many tasks.

Paper Structure

This paper contains 19 sections, 8 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Overview. Exhaustively evaluating multiple robot policies across various tasks has high experimenter cost. In this work, we leverage latent relationships between tasks and policies to model performance distributions across all tasks and policies. These estimates are updated sequentially and used to implement cost-aware active experiment selection strategies.
  • Figure 2: Method. We build a surrogate parameter estimation model that learns task and policy embeddings to predict the outcome performance distribution of a task and policy combination. We use Bernoulli distributions for binary outcomes or a bimodal Gaussian for continuous outcomes. Given this parameter estimation model, we develop an active testing strategy with cost-aware sampling based on expected information gain.
  • Figure 3: Offline Datasets used for Experiments. We consider 4 settings: (1) evaluations from HAMSTER li2025hamster, (2) evaluations from the OpenVLA paper kim24openvla, (3) MetaWorld yu2020meta where we evaluate different policies, and (4) MetaWorld where we evaluate multiple checkpoints of a single policy. For the MetaWorld evaluations, we can model the performance distributions of success rate or continuous rewards. For OpenVLA, the outcomes are binary success rate. For HAMSTER, evaluations were run over a large number of tasks only once while tracking only task progress, so we use this mean value as a mean for a unimodal Gaussian and a fixed standard deviation.
  • Figure 4: Task and Policy Representation Experiments. We compute the average log likelihood of all outcomes under probability distribution represented by the predicted population parameters across various policy and task representations. We evaluate these methods over the HAMSTER, OpenVLA, and MetaWorld Checkpoints offline evaluation datasets over continuous and binary performance distributions. We find no large difference between random or optimal embeddings as a policy representation, indicating that there is not much shared information between policies. However, we find that for task representation, Optimal consistently perform the best, followed by Verb, then Lang, and lastly Random. Language-based embeddings is a good task representation that we can leverage for better active learning.
  • Figure 5: Average Log Likelihood Over Cost. We show the average log likelihood of all the outcomes in our offline dataset against the cost of evaluation for MetaWorld Policies, MetaWorld Checkpoints, HAMSTER, and OpenVLA over continuous and binary performance distributions. Each set of experiments is run for 1500 trials. We find that EIG-based approaches struggle to model the true distribution in a more cost-efficient manner than Random Task sampling. Task-based sampling strategies are more cost-efficient than policy-task approaches.
  • ...and 2 more figures