Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

Mohit Sharma; Claudio Fantacci; Yuxiang Zhou; Skanda Koppula; Nicolas Heess; Jon Scholz; Yusuf Aytar

Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

Mohit Sharma, Claudio Fantacci, Yuxiang Zhou, Skanda Koppula, Nicolas Heess, Jon Scholz, Yusuf Aytar

TL;DR

Pretrained vision models offer transferable representations for robotics, but fine-tuning can erode the original capabilities. The authors introduce lossless adaptation by inserting parameter-efficient adapters at bottom, middle, and top positions to preserve pretrained representations while achieving near-full fine-tuning performance, validated across ViTs, NFNets, and ResNets with both supervised and self-supervised pretraining on Metaworld, Franka-Kitchen, and RGB Stacking, including sim2real transfer. Across diverse architectures and pretraining methods, adapters close the performance gap to full fine-tuning and enable robust sim2real transfer without altering the base model. This approach provides a scalable, storage-efficient path to reusing large vision foundation models for multi-task robotic manipulation.

Abstract

Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce "lossless adaptation" to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.

Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

TL;DR

Abstract

Paper Structure (22 sections, 7 figures, 7 tables)

This paper contains 22 sections, 7 figures, 7 tables.

Introduction
Related Works
Approach
Adapter Modules
Visual Adapters for Control
Experimental Setup
Manipulation Tasks
Network Architectures
Results
Fixed Pretrained Features vs Adapter Representations
Effects of Adapter Locations & Different Pretrained Representations
Sim2Real Results
Conclusion
Ethics Statement
Acknowledgments
...and 7 more sections

Figures (7)

Figure 1: Parameter efficient lossless adaptation. Existing works adapt preretrained general purpose visual models (a) through full end-to-end fine-tuning as shown in (b), which looses the original capabilities of the model; or adapting frozen pretrained models through top-adapters as shown in (c), which often fails to achieve optimal control performance. However, by introducing additional mid-level and bottom-level adaptation as in (d), we still maintain the existing perceptual capabilities while approaching the full fine-tuning performance as empirically shown in (e) over many network architectures and pretraining methods.
Figure 2: Adapter layers used for convolution based (Left) and transformer based (Right) architectures. For both scenarios we use a bottleneck design.
Figure 3: Different locations to insert adapter modules for convolution (Left) and transformer (Right) models.
Figure 4: Different environments we evaluate our approach on. For Metaworld and Kitchen suites we frollow the setup from nair2022r3m including the same set of demonstrations. For RGB-Stacking suite we use Skill Mastery setting lee2021beyond.
Figure 5: Ablation results on the RGB-Stacking environment for 3 different network architectures.
...and 2 more figures

Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

TL;DR

Abstract

Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)