Table of Contents
Fetching ...

Benchmarking Affordance Generalization with BusyBox

Dean Fortier, Timothy Adamson, Tess Hellebrekers, Teresa LaScala, Kofi Ennin, Michael Murray, Andrey Kolobov, Galen Mullins

TL;DR

BusyBox introduces a modular, 3D-printed physical benchmark to rigorously evaluate affordance generalization in Vision-Language-Action models. It provides six interchangeable modules that encode basic affordances and can be reconfigured to create many visual variants while preserving the same tasks, enabling cross-configuration generalization tests. The authors release CAD files and a language-annotated dataset of 1993 demonstrations, plus an experiment protocol and baseline results for strong open-weight VLAs such as $\pi_{0.5}$ and GR00T-N1.6, showing substantial performance drops when modules are rearranged or visually reconfigured. This work highlights a critical gap in current VLA generalization capabilities and offers a reproducible platform to accelerate improvements in affordance-aware manipulation.

Abstract

Vision-Language-Action (VLA) models have been attracting the attention of researchers and practitioners thanks to their promise of generalization. Although single-task policies still offer competitive performance, VLAs are increasingly able to handle commands and environments unseen in their training set. While generalization in vision and language space is undoubtedly important for robust versatile behaviors, a key meta-skill VLAs need to possess is affordance generalization -- the ability to manipulate new objects with familiar physical features. In this work, we present BusyBox, a physical benchmark for systematic semi-automatic evaluation of VLAs' affordance generalization. BusyBox consists of 6 modules with switches, sliders, wires, buttons, a display, and a dial. The modules can be swapped and rotated to create a multitude of BusyBox variations with different visual appearances but the same set of affordances. We empirically demonstrate that generalization across BusyBox variants is highly challenging even for strong open-weights VLAs such as $π_{0.5}$ and GR00T-N1.6. To encourage the research community to evaluate their own VLAs on BusyBox and to propose new affordance generalization experiments, we have designed BusyBox to be easy to build in most robotics labs. We release the full set of CAD files for 3D-printing its parts as well as a bill of materials for (optionally) assembling its electronics. We also publish a dataset of language-annotated demonstrations that we collected using the common bimanual Mobile Aloha robot on the canonical BusyBox configuration. All of the released materials are available at https://microsoft.github.io/BusyBox.

Benchmarking Affordance Generalization with BusyBox

TL;DR

BusyBox introduces a modular, 3D-printed physical benchmark to rigorously evaluate affordance generalization in Vision-Language-Action models. It provides six interchangeable modules that encode basic affordances and can be reconfigured to create many visual variants while preserving the same tasks, enabling cross-configuration generalization tests. The authors release CAD files and a language-annotated dataset of 1993 demonstrations, plus an experiment protocol and baseline results for strong open-weight VLAs such as and GR00T-N1.6, showing substantial performance drops when modules are rearranged or visually reconfigured. This work highlights a critical gap in current VLA generalization capabilities and offers a reproducible platform to accelerate improvements in affordance-aware manipulation.

Abstract

Vision-Language-Action (VLA) models have been attracting the attention of researchers and practitioners thanks to their promise of generalization. Although single-task policies still offer competitive performance, VLAs are increasingly able to handle commands and environments unseen in their training set. While generalization in vision and language space is undoubtedly important for robust versatile behaviors, a key meta-skill VLAs need to possess is affordance generalization -- the ability to manipulate new objects with familiar physical features. In this work, we present BusyBox, a physical benchmark for systematic semi-automatic evaluation of VLAs' affordance generalization. BusyBox consists of 6 modules with switches, sliders, wires, buttons, a display, and a dial. The modules can be swapped and rotated to create a multitude of BusyBox variations with different visual appearances but the same set of affordances. We empirically demonstrate that generalization across BusyBox variants is highly challenging even for strong open-weights VLAs such as and GR00T-N1.6. To encourage the research community to evaluate their own VLAs on BusyBox and to propose new affordance generalization experiments, we have designed BusyBox to be easy to build in most robotics labs. We release the full set of CAD files for 3D-printing its parts as well as a bill of materials for (optionally) assembling its electronics. We also publish a dataset of language-annotated demonstrations that we collected using the common bimanual Mobile Aloha robot on the canonical BusyBox configuration. All of the released materials are available at https://microsoft.github.io/BusyBox.
Paper Structure (17 sections, 5 figures, 1 table)

This paper contains 17 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: configurations used in our experiments. All of them consists of 6 modules: buttons, display, knob, sliders, switches, and wires. These modules can be swapped with each other and rotated. (a) is the canonical configuration, on which we collected a dataset of demonstrations (see \ref{['fig:data_pie']}). (b) is a semi-shuffled configuration: the positions of buttons, wires, and display modules are different w.r.t. the canonical configuration in (a), and the buttons module is also rotated upside down. (c) is fully shuffled -- all 5 manipulable modules are different in position or orientation compared to the canonical : buttons, knob, sliders, and switches are moved, and the wires module is flipped. Many other shuffled configurations are possible as well.
  • Figure 2: Breakdown of the dataset of 1993 demonstrations at https://microsoft.github.io/BusyBox by affordance category. * denotes bimanual affordances, although virtually tasks benefit from positioning both robot arms so that their wrist cameras observe the manipulated object up-close.
  • Figure 3: Disassembled
  • Figure 4: Illustration of our data collection setup based on Mobile Aloha.
  • Figure 5: Results of the affordance generalization experiment: despite all configurations being affordance-wise in-distribution w.r.t. the training dataset (\ref{['fig:data_pie']}), $\pi_{0.5}$-canon and GR00T-N1.6-canon performed well only on the visually in-distribution canonical configuration (\ref{['fig:res_canon']}), struggling with the visually out-of-distribution shuffled variants (\ref{['fig:res_semi']} and \ref{['fig:res_fully']}).