Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies
Jianing Qian, Anastasios Panagopoulos, Dinesh Jayaraman
TL;DR
This work tackles the mismatch between generic pre-trained vision transformers and robotics needs by introducing SOFT, a training-free wrapper that uses transformer attentions to identify and describe object-like regions within images. SOFT infers an object-centric embedding by computing inputwise attentions through attention rollout, performing background removal, and applying spectral clustering to produce object slots. These slots are described via activation-based features and then aligned to a fixed-size policy input using Hungarian matching, enabling behavior cloning for manipulation tasks. Across synthetic and real robotics domains, SOFT consistently outperforms vanilla PVT features and approaches robotics-specific representations, demonstrating that attention signals can bridge the gap between general vision models and robot learning. The results suggest a practical path to leveraging large, generic vision models for efficient, robust robot manipulation without additional training.
Abstract
Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose Scene Objects From Transformers, abbreviated as SOFT, a wrapper around pre-trained vision transformer (PVT) models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric embedding. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained on SOFT(PVT) far outstrip standard PVT representations for manipulation tasks in simulated and real settings, approaching the state-of-the-art robotics-aware representations. Code, appendix and videos: https://sites.google.com/view/robot-soft/
