Table of Contents
Fetching ...

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

TL;DR

The paper investigates how pre-training data and methods shape the transferability of vision models to robot learning. It shows that DINO and iBOT excel on manipulation and perception tasks when pre-trained on object-centric data, but their performance degrades with non-object-centric data, revealing a bottleneck in learning objectness from diverse sources. To address this, the authors introduce SlotMIM, which enforces object-centric representations from NOC data via a semantic bottleneck and cross-view consistency, enabling effective object discovery and slot-based contrastive learning. Across diverse datasets and tasks, SlotMIM delivers superior data efficiency and scalability, even when trained on far fewer samples, and scales well to million-image data, offering a practical path to robust PVMs for robot learning.

Abstract

Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

TL;DR

The paper investigates how pre-training data and methods shape the transferability of vision models to robot learning. It shows that DINO and iBOT excel on manipulation and perception tasks when pre-trained on object-centric data, but their performance degrades with non-object-centric data, revealing a bottleneck in learning objectness from diverse sources. To address this, the authors introduce SlotMIM, which enforces object-centric representations from NOC data via a semantic bottleneck and cross-view consistency, enabling effective object discovery and slot-based contrastive learning. Across diverse datasets and tasks, SlotMIM delivers superior data efficiency and scalability, even when trained on far fewer samples, and scales well to million-image data, offering a practical path to robust PVMs for robot learning.

Abstract

Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.

Paper Structure

This paper contains 73 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An overview of this paper. (a) We conduct a comprehensive study evaluating pre-trained vision models (PVMs) on visuomotor control and perception tasks, analyzing how different pretraining (model, data) combinations affect performance. Our analysis reveals that DINO/iBOT excels while MAE underperforms. (b) We investigate the performance drop of DINO/iBOT when trained on non-(single-)object-centric (NOC) data, discovering they struggle to learn objectness from NOC data---a capability that strongly correlates with robot manipulation performance. (c) We introduce SlotMIM, which incorporates explicit objectness guidance during training to effectively learn object-centric representations from NOC data. (d) Through scaled-up pre-training and evaluation across six tasks, we demonstrate that SlotMIM adaptively learns different types of objectness based on the pre-training dataset characteristics, outperforming existing methods.
  • Figure 2: Performance of PVMs trained with different (model, data) combinations on visuomotor control and perception tasks. (241K scale, best viewed together with \ref{['fig:teaser']}a) Our analysis of existing works reveals several key findings: 1) MAE with ego-centric data shows only moderate performance on visuomotor control tasks and performs poorly on ADE20K; 2) DINO and iBOT lead performance across all tasks, with their best models typically trained on object-centric data (except for ADE20K); 3) The top-3 models (DINO, iBOT, and MAE) struggle to learn effective representations for manipulation when trained on scene-centric data. Most notably, 4) SlotMIM (\ref{['sec:method']}) consistently outperforms prior methods regardless of whether it is pre-trained on object-centric data or not.
  • Figure 3: Behavior cloning with attentive probing. An additional token is trained with cross-attention (trainable) to gather information from all patch tokens from the backbone (frozen), and fed to the policy to learn from expert demonstrations via behavior cloning.
  • Figure 4: Comparison of concepts learned by iBOT and SlotMIM. All models are trained on COCO+ for 800 epochs. While iBOT can discover fine-grained patterns, especially when using fewer prototypes (left), these patterns emerge bottom-up and lack semantic meaning. In contrast, SlotMIM's concepts are semantically coherent, making them more effective for instance discrimination pretext tasks (right).
  • Figure 5: Overview of SlotMIM. Our framework extends iBOT by: 1) repurposing its within-view patch-level self-distillation for object discovery, 2) introducing a cross-view objective for semantic guidance, and 3) performing object-centric contrastive learning on slots (object features grouped from patches with matching cluster assignments). This approach provides explicit objectness supervision without requiring object-centric data, making it applicable to various types of NOC data (see \ref{['fig:teaser']}c for comparison and \ref{['fig:teaser']}d for results).
  • ...and 4 more figures