Table of Contents
Fetching ...

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Marcel Torne Villasevil, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Jie Wang, William Reger, Xingfang Yuan, Xuning Yang, Abhishek Gupta, Dinesh Jayaraman, Glen Berseth, Kostas Daniilidis, Roberto Martin-Martin, Youngwoon Lee, Percy Liang, Chelsea Finn, Sergey Levine

TL;DR

RoboArena tackles the challenge of evaluating generalist robot policies across broad real-world tasks by introducing a crowd-sourced, pairwise, double-blind evaluation framework that aggregates preferences into a global ranking. It extends the Bradley-Terry model with task-difficulty buckets and policy-task offsets and fits it via an EM algorithm, enabling robust rankings from asynchronous evaluations. In a DROID-based instantiation across seven universities, RoboArena achieves rankings more aligned with an exhaustive oracle than centralized benchmarks and demonstrates comparable sample efficiency. The work also adds LL M-/VLM-assisted qualitative analysis tools and opens the framework to the community, aiming to standardize and scale comparisons of generalist robot policies.

Abstract

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

TL;DR

RoboArena tackles the challenge of evaluating generalist robot policies across broad real-world tasks by introducing a crowd-sourced, pairwise, double-blind evaluation framework that aggregates preferences into a global ranking. It extends the Bradley-Terry model with task-difficulty buckets and policy-task offsets and fits it via an EM algorithm, enabling robust rankings from asynchronous evaluations. In a DROID-based instantiation across seven universities, RoboArena achieves rankings more aligned with an exhaustive oracle than centralized benchmarks and demonstrates comparable sample efficiency. The work also adds LL M-/VLM-assisted qualitative analysis tools and opens the framework to the community, aiming to standardize and scale comparisons of generalist robot policies.

Abstract

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.

Paper Structure

This paper contains 44 sections, 24 equations, 16 figures, 3 tables, 2 algorithms.

Figures (16)

  • Figure 1: We present RoboArena, a distributed real-world evaluation framework for generalist robot policies. Instead of standardizing environments and tasks, RoboArena aggregates crowd-sourced pairwise A/B policy evaluations across a broad spectrum of environments and tasks to derive a global policy ranking. Its decentralized design makes RoboArena a scalable, comprehensive, and trustworthy framework for generalist robot policy evaluation. We open-source an instantiation of RoboArena on the DROID robot platform khazatsky2024droid and invite community members to participate, both by contributing policies and running evaluations.
  • Figure 2: Pipeline for extracting qualitative policy characteristics from RoboArena's rich evaluation data. We use a VLMs to categorize scenes and tasks, and then use an LLM to aggregate information across a large number of evaluation rollouts into a policy report that summarizes qualitative strengths and weaknesses, and cites concrete evaluation rollout videos as evidence.
  • Figure 3: The DROID robot setup, which we use for the DROID-RoboArena evaluation system. Reproduced with permission from khazatsky2024droid.
  • Figure 4: The DROID-RoboArena system consists of a pool of remotely hosted policy servers, a pool of distributed evaluator "clients" with real robot setups, a database for storing evaluation results, and a central evaluation management server that orchestrates communication, aggregates the evaluation results, and computes a policy ranking.
  • Figure 5: Left: Examples of RoboArena evaluations. Evaluations span a diverse set of scenes and tasks. Right: "Oracle" policy ranking, aggregated from progress scores of 4284 evaluation rollouts.
  • ...and 11 more figures