THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Wilbert Pumacay; Ishika Singh; Jiafei Duan; Ranjay Krishna; Jesse Thomason; Dieter Fox

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, Dieter Fox

TL;DR

The Colosseum introduces a large-scale, perturbation-focused benchmark to quantify generalization in robotic manipulation across 20 RLBench tasks and 14 environmental factors. By evaluating 5 state-of-the-art BC approaches, including 2D and 3D representations as well as a zero-shot world-model method, the work reveals substantial performance drops under perturbations and highlights the relative robustness of 3D-based and world-model–driven approaches. The study also demonstrates a meaningful correlation between simulated and real-world perturbations, supporting the benchmark’s ecological validity, and provides open-source resources for reproducibility. Overall, The Colosseum offers a scalable platform to diagnose robustness bottlenecks and guide future improvements in generalizable manipulation policies.

Abstract

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enables systematical evaluation of models across 14 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, physical properties perturbations and camera pose. Using THE COLOSSEUM, we compare 5 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors. When multiple perturbations are applied in unison, the success rate degrades $\geq$75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated ($\bar{R}^2 = 0.614$) to similar perturbations in real-world experiments. We open source code for others to use THE COLOSSEUM, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation. See https://robot-colosseum.github.io/ for more details.

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

TL;DR

Abstract

75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated (

) to similar perturbations in real-world experiments. We open source code for others to use THE COLOSSEUM, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation. See https://robot-colosseum.github.io/ for more details.

Paper Structure (61 sections, 32 figures, 8 tables)

This paper contains 61 sections, 32 figures, 8 tables.

Introduction
Related Work
Robotic Manipulation Benchmarks
Robotic Manipulation Methods
Generalization in Robotic Manipulation
The Colosseum
Methodology for Task Selection
Perturbation Factors
Manipulation object (MO) perturbation
Receiver object (RO) perturbation
Background perturbation
Physical perturbation
Implementation of Perturbation Factors
Real-World Tasks and Perturbations
The Colosseum Challenge
...and 46 more sections

Figures (32)

Figure 1: Evaluating generalization with The Colosseum. Task-averaged success rate for 5 SotA robotic manipulation policies over 14 perturbation factors and 20 robotic manipulation tasks. Changes in RGB input space affects all models due to end-to-end RGB-based training. Image-based models are also affected by camera pose change, while models without in-the-wild pretraining suffer in the presence of distractors.
Figure 2: The Colosseum Challenge. This challenge is designed to enhance generalization of Behavior Cloning (BC) models in robotic manipulation tasks. It involves four key phases: 1) Participants generate a standard training dataset from 20 tasks with 100 demonstrations each, without perturbation_factors. 2) Participants train their BC models using this standardized dataset. 3) The models are restricted to evaluate over a fixed 25 episodes across 14 different perturbation_factors. 4) Models are ranked on a leaderboard based on the percentage change in their performance across these factors. We've shown that simulation aligns with real-world evaluation, so participants can expect similar generalization when participating in the simulation benchmark.
Figure 3: The Colosseum benchmark distribution. This benchmark encompasses 14 perturbation_factors within 20 distinct RLBench tasks, categorized into three tiers (simple, intermediate, and complex) according to the number of way-points involved (task horizon). Collectively, The Colosseum presents 20,371 unique task perturbations instances.
Figure 4: Real-World training tasks and their evaluation time perturbations. A PerAct agent, trained using real-world demonstrations for the four tasks shown, was tested on real-world perturbation_factors. This evaluation involved perturbing factors similar to the procedural benchmark in the simulation.
Figure 5: Task-averaged success rate % change for 4 baseline models on perturbation_factors, compared to No Perturbation test set. We report the evaluation with All Perturbations enabled, followed by each individual factor, average of all individual factors, and on RLBench variations (that is sampled from the same distribution as the training set). The images on top show failure examples for each factor with captions explaining the failure. $\bullet$ indicates undefined value when the corresponding No Perturbation task averages are also 0. $\circ$ indicates 0% change with respect to No Perturbation task average.
...and 27 more figures

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

TL;DR

Abstract

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (32)