Table of Contents
Fetching ...

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel Szafir

TL;DR

This work identifies Observation Space Shift (OSS) as a fundamental challenge in long-horizon visuomotor skill chaining, where visual changes caused by preceding skills disrupt subsequent skill policies. It introduces BOSS, a Libero-based benchmark with three progressive OSS challenges (Single Predicate Shift, Accumulated Predicate Shift, and Skill Chaining) and evaluates four imitation-learning baselines, revealing substantial OSS-induced degradation even for simple shifts. The study demonstrates that data augmentation through RAMG-derived demonstrations provides limited mitigation, underscoring the need for algorithmic solutions that explicitly handle observation-space variations. Overall, BOSS establishes a rigorous framework to study OSS, quantify its impact on both per-skill and chain-level performance, and motivate the development of robust, visually-aware policies for long-horizon robotic tasks.

Abstract

Robotics has long sought to develop visual-servoing robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To validate OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift", "Accumulated Predicate Shift", and "Skill Chaining", each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate a potential solution to OSS that scales up the training data for each skill with a larger and more visually diverse set of demonstrations, with our results showing it is not sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

TL;DR

This work identifies Observation Space Shift (OSS) as a fundamental challenge in long-horizon visuomotor skill chaining, where visual changes caused by preceding skills disrupt subsequent skill policies. It introduces BOSS, a Libero-based benchmark with three progressive OSS challenges (Single Predicate Shift, Accumulated Predicate Shift, and Skill Chaining) and evaluates four imitation-learning baselines, revealing substantial OSS-induced degradation even for simple shifts. The study demonstrates that data augmentation through RAMG-derived demonstrations provides limited mitigation, underscoring the need for algorithmic solutions that explicitly handle observation-space variations. Overall, BOSS establishes a rigorous framework to study OSS, quantify its impact on both per-skill and chain-level performance, and motivate the development of robust, visually-aware policies for long-horizon robotic tasks.

Abstract

Robotics has long sought to develop visual-servoing robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To validate OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift", "Accumulated Predicate Shift", and "Skill Chaining", each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate a potential solution to OSS that scales up the training data for each skill with a larger and more visually diverse set of demonstrations, with our results showing it is not sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/

Paper Structure

This paper contains 30 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The example illustrates how Observation Space Shift (OSS) occurs when chaining two pre-trained skills: PlaceObject(potato, bowl) and MoveContainer(bowl, cabinet). During deployment, OSS arises in MoveContainer(bowl, cabinet) because the observation space changes, with a potato inside the bowl instead of the empty bowl scenario from MoveContainer(bowl, cabinet)'s pre-training.
  • Figure 2: This figure illustrates the three challenges of , each examining a distinct aspect of OSS, using concrete examples: OpenDrawer(cabinet, bottom) (green), PlaceObject(potato, bowl) (red), and MoveContainer(bowl, cabinet) (blue). Challenge 1, Single Predicate Shift (BOSS-C1), shows a case where OSS occurs due to the modification of a single predicate (i.e., the circle in the figure) caused by the effect of the previous skill (e.g., IN(potato, bowl)). Challenge 2, Accumulated Predicate Shift (BOSS-C2), highlights the scenario where OSS arises from multiple predicate changes (i.e., circles in the figure) due to accumulated effects from preceding skills (e.g., IN(potato, bowl) and DrawerOpen(cabinet, top)). Challenge 3, Real Long-Horizon Task (BOSS-C3), showcases how OSS impacts a real long-horizon task with three skills, where "Single Predicate Shift" and "Accumulated Predicate Shift" occur in the second skill and the third skill respectively, significantly degrading the final task performance.
  • Figure 3: This figure presents the results for BOSS-C1. In each baseline subfigure, the majority of points lie below the diagonal line, representing tasks with a positive Ratio Performance Delta, indicating that single predicate modification negatively affects tasks performance.
  • Figure 4: This figure shows the results for BOSS-C2. Two bar charts summarize: (top) the average positive Ratio Performance Delta for sets with different numbers of modifications, and (bottom) the average ratio of OSS occurrence across sets with varying numbers of modifications. The upward trend of the red lines in both bar charts indicates that the accumulation of OSS progressively exacerbates its negative impact on long-horizon task completion, both in magnitude and frequency.
  • Figure 5: This figure presents the results for BOSS-C3, where the "Delta to Upper Bound Ratio" (bars) is positive and notably high in most cases, highlighting the substantial negative impact of OSS on long-horizon task completion.