Table of Contents
Fetching ...

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song

TL;DR

This work identifies shortcut learning as a key obstacle to generalization in generalist robot policies and links it to dataset structure, specifically limited sub-dataset diversity and fragmentation across sub-datasets. It provides a formal framework relating task-relevant and task-irrelevant factors, supported by theoretical propositions and empirical validation on LIBERO and real-world setups. The authors show that increasing intra-subdataset diversity and reducing inter-subdataset disparity mitigate shortcuts, and demonstrate practical data-augmentation strategies—viewpoint and object augmentation—to enhance diversity and bridge distribution gaps in offline data. The findings offer actionable guidance for dataset collection and augmentation to improve both simulation and real-world generalization of Vision-Language-Action policies, especially when acquiring new large-scale data is impractical.

Abstract

Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning -- the reliance on task-irrelevant features -- as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $π_0$, in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

TL;DR

This work identifies shortcut learning as a key obstacle to generalization in generalist robot policies and links it to dataset structure, specifically limited sub-dataset diversity and fragmentation across sub-datasets. It provides a formal framework relating task-relevant and task-irrelevant factors, supported by theoretical propositions and empirical validation on LIBERO and real-world setups. The authors show that increasing intra-subdataset diversity and reducing inter-subdataset disparity mitigate shortcuts, and demonstrate practical data-augmentation strategies—viewpoint and object augmentation—to enhance diversity and bridge distribution gaps in offline data. The findings offer actionable guidance for dataset collection and augmentation to improve both simulation and real-world generalization of Vision-Language-Action policies, especially when acquiring new large-scale data is impractical.

Abstract

Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning -- the reliance on task-irrelevant features -- as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., , in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.

Paper Structure

This paper contains 36 sections, 2 theorems, 23 equations, 16 figures, 3 tables.

Key Result

Proposition 3.1

Given two sub-datasets where the supports for both variables are disjoint, i.e., $U_1\cap U_2=\varnothing$ and $V_1\cap V_2=\varnothing$, the normalized mutual information between $u$ and $v$ is given by: where $C_{\mathrm{diversity}} = H(u_1)+H(u_2)+H(v_1)+H(v_2)$ is the sum of the entropies within each sub-dataset.

Figures (16)

  • Figure 1: Demonstrations of shortcut learning in generalist robot policies.Left: Three generalist robot policies trained on the OXE dataset exhibit shortcut behavior in the SIMPLER environment Simpler. Despite being tasked with "put the spoon on the towel", a task present in the Bridge sub-dataset bridge, all models consistently perform the task "pick up the coke" which is exclusive to the RT-1 sub-dataset rt-1. Right: $\pi_0$pi_0 policy after finetuning on real-world data exhibits shortcut behavior. The policy was finetuned on two distinct data subsets: (Viewpoint A, Instruction C) and (Viewpoint B, Instruction D). When tasked with instruction D from the novel configuration of Viewpoint A, the policy incorrectly executes Instruction C. This indicates that the policy has learned to associate the viewpoint with the action, ignoring the provided instruction.
  • Figure 2: Comparison of visual (left) and text (right) diversity (log scale) between OXE Sub-Datasets and vision/multimodal Datasets. OXE sub-datasets exhibit significantly lower diversity compared to their vision and multimodal counterparts. We simply chose $t=20$ as it does not influence the general trend.
  • Figure 3: Comparison of t-SNE visualizations for vision/multimodal datasets (left) versus OXE Magic Soup++ (right). The figure shows the clear data fragmentation in the OXE dataset, in contrast to the more intertwined data structure observed in the visual and multimodal datasets.
  • Figure 4: Comparison of the visual disparity metric $S_{\mathrm{disparity}}$ (top) and the combined metric $\frac{S_\mathrm{disparity}}{S_\mathrm{diversity}}$ (bottom) between OXE and vision/multimodal datasets at different temperatures.
  • Figure 5: An example of our LIBERO experiment setting, with only one task (or equivalently, one object position/language) within each sub-dataset.
  • ...and 11 more figures

Theorems & Definitions (4)

  • Proposition 3.1: Mutual Information in Disjoint Sets
  • Proposition 3.2: Mutual Information in Overlapping Sets
  • proof
  • proof