Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

Jianing Qian; Anastasios Panagopoulos; Dinesh Jayaraman

Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

Jianing Qian, Anastasios Panagopoulos, Dinesh Jayaraman

TL;DR

This work tackles the mismatch between generic pre-trained vision transformers and robotics needs by introducing SOFT, a training-free wrapper that uses transformer attentions to identify and describe object-like regions within images. SOFT infers an object-centric embedding by computing inputwise attentions through attention rollout, performing background removal, and applying spectral clustering to produce object slots. These slots are described via activation-based features and then aligned to a fixed-size policy input using Hungarian matching, enabling behavior cloning for manipulation tasks. Across synthetic and real robotics domains, SOFT consistently outperforms vanilla PVT features and approaches robotics-specific representations, demonstrating that attention signals can bridge the gap between general vision models and robot learning. The results suggest a practical path to leveraging large, generic vision models for efficient, robust robot manipulation without additional training.

Abstract

Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose Scene Objects From Transformers, abbreviated as SOFT, a wrapper around pre-trained vision transformer (PVT) models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric embedding. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained on SOFT(PVT) far outstrip standard PVT representations for manipulation tasks in simulated and real settings, approaching the state-of-the-art robotics-aware representations. Code, appendix and videos: https://sites.google.com/view/robot-soft/

Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 7 figures, 4 tables)

This paper contains 18 sections, 2 equations, 7 figures, 4 tables.

INTRODUCTION
Related Work
Robotics-Specific Pre-Trained Image Representations
Object-Centric Embeddings (OCE)
Object Discovery from Image Encoders
Manipulation Policies Using Representations Inferred from Generic Vision Transformers
The Information Held Within Transformer Self-Attentions
A Procedure To Infer Objects and Their Locations
Activation-Based Object Descriptions
Policy Learning from Demonstrations
Experiments
Discovering Object-Like Regions
Robot Manipulation Experiments
Real Robot Experiments
CONCLUSIONS
...and 3 more sections

Figures (7)

Figure 1: Visualizing attentions for two images from ImageNet and Shapestacks. On each image, we consider a foreground patch (yellow) and a background patch (red). Attentions $a_{ii}$ for each patch towards itself are zeroed out to identify the patch in attention images. Layerwise attentions at various DINO layers (left) reveal little information at higher layers, so we instead compute inputwise attentions (right) using attention flow.
Figure 2: SOFT$(\cdot)$ provides a wrapper around any pre-trained vision transformer model PVT. Relying on nothing other than the activations and attentions throughout PVT, SOFT$(\cdot)$ offers an alternative representation inference procedure to the standard last-layer activations. The resulting SOFT$(\texttt{PVT}\xspace)$ representation is an object-centric image representation, suitable for off-the-shelf usage in robotic control tasks. $V^l$, $K^l$ and $Q^l$ represent the value, key and query of attention layers at layer $l$.
Figure 3: Data from our datasets and simulation environments. We show the segmentation masks from SOFT$(\texttt{PVT}\xspace)$ as well as the original images. Blue indicates background.
Figure 4: Success rate of different methods as a function of the number of demonstrations.
Figure 5: Example of a successful policy rollout using SOFT(DINOv2)
...and 2 more figures

Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

TL;DR

Abstract

Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (7)