Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

Dongyoon Hwang; Byungkun Lee; Hojoon Lee; Hyunseung Kim; Jaegul Choo

Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

Dongyoon Hwang, Byungkun Lee, Hojoon Lee, Hyunseung Kim, Jaegul Choo

TL;DR

This paper tackles the challenge of adapting pretrained Vision Transformers (ViTs) for visuo-motor control by introducing CoIn, a lightweight Convolution Injector that injects locality and translation equivariance biases via a compact CNN encoder and deformable cross-attention. By keeping the ViT architecture intact and only adding CoIn, the method leverages strong pretrained representations while providing control-centric inductive biases, enabling effective end-to-end finetuning with modest computational overhead. Across 12 tasks in Adroit, MetaWorld, and DMC and three pretrained ViTs (CLIP, MVP, VC-1), CoIn yields consistent performance gains, notably an 11.3-point mean uplift with CLIP and meaningful gains with MVP and VC-1, highlighting its ability to deepen ViT representations for motor control. The work demonstrates the practical viability of integrating convolutional priors into foundation models to enhance real-world visuo-motor capabilities, with promising directions for reinforcement learning and real-robot deployment.

Abstract

Vision Transformers (ViT), when paired with large-scale pretraining, have shown remarkable performance across various computer vision tasks, primarily due to their weak inductive bias. However, while such weak inductive bias aids in pretraining scalability, this may hinder the effective adaptation of ViTs for visuo-motor control tasks as a result of the absence of control-centric inductive biases. Such absent inductive biases include spatial locality and translation equivariance bias which convolutions naturally offer. To this end, we introduce Convolution Injector (CoIn), an add-on module that injects convolutions which are rich in locality and equivariance biases into a pretrained ViT for effective adaptation in visuo-motor control. We evaluate CoIn with three distinct types of pretrained ViTs (CLIP, MVP, VC-1) across 12 varied control tasks within three separate domains (Adroit, MetaWorld, DMC), and demonstrate that CoIn consistently enhances control task performance across all experimented environments and models, validating the effectiveness of providing pretrained ViTs with control-centric biases.

Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 6 figures, 13 tables)

This paper contains 31 sections, 4 equations, 6 figures, 13 tables.

Introduction
Related Work
Pretrained Visual Encoders for Control
Integration of CNNs with Pretrained ViTs in Computer Vision
Method
Vision Transformer
CNN Encoder
Cross Attention Module
Implementation and Computation Requirements
Experiment Setup
Environments
Models
Downstream Evaluation
Experiments
Main Results
...and 16 more sections

Figures (6)

Figure 1: Avg. performance across 12 visuo-motor control tasks. Our model CoIn introduces convolutional inductive biases into ViTs, resulting in consistent performance improvements for various pretrained ViTs.
Figure 2: Overall framework. (Stage 1) The advent of open-sourced, large-scale ViTs pretrained with extensive web-scale datasets provides generalized, ready-to-go visual representations. (Stage 2) To adapt these pretrained ViTs for visuo-motor control, we finetune them with an additional light-weight module, CoIn, enhancing the ViT's ability to extract visual features beneficial for control, such as spatial locality and translation equivariance.
Figure 3: Overall architecture of CoIn. While leaving the (a) ViT architecture untouched, (b) CoIn incorporates two key modules: a CNN encoder, which captures spatial locality and translation equivariance rich features from the input image, and a cross attention module, which introduces such biases into the ViT patch token embeddings. Notably, these enhancements are seamlessly integrated without any modification to the overall ViT architecture.
Figure 4: Visualization of tasks used in our evaluation. We utilize 2 tasks from Adroit, 5 tasks from Metaworld, and 5 tasks from DMC.
Figure 5: (a) Comparison of the relative log amplitudes of Fourier-transformed feature maps. ViT + CoIn incorporates beneficial inductive biases extracted from convolutional networks, allowing it to capture more high-frequency signals compared to ViT. (b) Translation equivariance comparison. ViT + CoIn enhances translation equivariance across intermediate representations within the ViT. (c) Visualization of self-attention maps obtained through Attention Rollout. ViT + CoIn exhibits improved focus on critical regions for visuo-motor control. All analysis were performed on VC-1 and averaged across all 12 tasks.
...and 1 more figures

Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

TL;DR

Abstract

Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

Authors

TL;DR

Abstract

Table of Contents

Figures (6)