BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Yue Yang; Linfeng Zhao; Mingyu Ding; Gedas Bertasius; Daniel Szafir

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel Szafir

TL;DR

This work identifies Observation Space Shift (OSS) as a fundamental challenge in long-horizon visuomotor skill chaining, where visual changes caused by preceding skills disrupt subsequent skill policies. It introduces BOSS, a Libero-based benchmark with three progressive OSS challenges (Single Predicate Shift, Accumulated Predicate Shift, and Skill Chaining) and evaluates four imitation-learning baselines, revealing substantial OSS-induced degradation even for simple shifts. The study demonstrates that data augmentation through RAMG-derived demonstrations provides limited mitigation, underscoring the need for algorithmic solutions that explicitly handle observation-space variations. Overall, BOSS establishes a rigorous framework to study OSS, quantify its impact on both per-skill and chain-level performance, and motivate the development of robust, visually-aware policies for long-horizon robotic tasks.

Abstract

Robotics has long sought to develop visual-servoing robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To validate OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift", "Accumulated Predicate Shift", and "Skill Chaining", each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate a potential solution to OSS that scales up the training data for each skill with a larger and more visually diverse set of demonstrations, with our results showing it is not sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

TL;DR

Abstract

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)