What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

Kaylee Burns; Zach Witzel; Jubayer Ibn Hamid; Tianhe Yu; Chelsea Finn; Karol Hausman

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

Kaylee Burns, Zach Witzel, Jubayer Ibn Hamid, Tianhe Yu, Chelsea Finn, Karol Hausman

TL;DR

The paper investigates why pre-trained visual representations generalize for robust robotic manipulation under visual distribution shifts. By benchmarking 15 models across two simulated environments and validating on a real-world task, it reveals that emergent segmentation ability in Vision Transformers (ViTs), quantified by a Jaccard index over attention maps, is a strong predictor of out-of-distribution generalization, outperforming traditional metrics like in-domain accuracy and ImageNet linear probes. Manipulation-focused pre-training does not consistently outperform standard pre-training, and self-supervised ViTs (e.g., DINO) can excel under distribution shifts. These findings suggest designing foundation models for robotics that emphasize segmentation capabilities in attention rather than solely increasing data scale or using task-specific pre-training, with practical implications for robust perception in real-world robotics.

Abstract

Inspired by the success of transfer learning in computer vision, roboticists have investigated visual pre-training as a means to improve the learning efficiency and generalization ability of policies learned from pixels. To that end, past work has favored large object interaction datasets, such as first-person videos of humans completing diverse tasks, in pursuit of manipulation-relevant features. Although this approach improves the efficiency of policy learning, it remains unclear how reliable these representations are in the presence of distribution shifts that arise commonly in robotic applications. Surprisingly, we find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture or the introduction of distractor objects. To understand what properties do lead to robust representations, we compare the performance of 15 pre-trained vision models under different visual appearances. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models. The rank order induced by this metric is more predictive than metrics that have previously guided generalization research within computer vision and machine learning, such as downstream ImageNet accuracy, in-domain accuracy, or shape-bias as evaluated by cue-conflict performance. We test this finding extensively on a suite of distribution shifts in ten tasks across two simulated manipulation environments. On the ALOHA setup, segmentation score predicts real-world performance after offline training with 50 demonstrations.

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 12 figures, 4 tables)

This paper contains 22 sections, 1 equation, 12 figures, 4 tables.

Introduction
Related Work
Environments, Evaluation Protocol, and Pre-Trained Models
Generalization of Models Pre-Trained for Manipulation
Properties of Robust Visual Representations for Manipulation
Metrics
Setup
Results
Validating in the real world
Conclusion
Appendix
Pre-Trained Model Details
Details of the Environments
Details of the Disribution Shifts
Policy Training Details
...and 7 more sections

Figures (12)

Figure 1: We find that the emergent segmentation ability of ViT attention heads (measured by Jaccard index) predicts performance under visual distribution shift. We refer to models with this property as having "segmenting-features." Notice how the attention of MVP shifts towards the sugar box distractor object in the bottom right image. The impact of this factor overshadows other design choices such as data relevance.
Figure 2: Evaluation Scheme. We begin our evaluation procedure by training a policy with behavior cloning on top of frozen features. In every experimental setting, we ablate the encoder used to extract features from the image observation. The learned policy is then evaluated in each of the visual shift environments to attain a zero-shot success value.
Figure 3: Visual Generalization Performance. Models trained with supervision on ImageNet are shades of blue. Models trained with self-supervision on ImageNet are in red. Models trained explicitly for manipulation and control tasks are orange. Dotted bars denote ResNets and slashed bars denote ViTs. Surprisingly, the best performing models are not necessarily the ones designed for manipulation. Each bar is an average over 30 experimental conditions.
Figure 4: Average success rates for training and test distribution across both environments for every model in our evaluation suite. The best-performing model that was designed for manipulation ranks seventh out of all models evaluated.
Figure 5: We plot the relationship between different metrics and out-of-distribution (OOD) generalization. There is a promising correlation between shape-bias and OOD performance for ResNets, but not ViTs. Instead, OOD performance for ViTs is strongly correlated with Jaccard index.
...and 7 more figures

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

TL;DR

Abstract

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)