Table of Contents
Fetching ...

Convolutional Bypasses Are Better Vision Transformer Adapters

Shibo Jie, Zhi-Hong Deng

TL;DR

This paper presents Convpass, a vision-oriented parameter-efficient adaptation module for Vision Transformers (ViT) that uses convolutional bypass blocks in parallel with MHSA/MLP layers. Convpass introduces a lightweight, inductive bias via 3x3 convolutions to better capture spatial structure, achieving strong results on VTAB-1K and few-shot benchmarks while requiring only a tiny fraction of trainable parameters. Extensive experiments demonstrate Convpass outperforms language-oriented PETL methods across natural, specialized, and structured tasks, and also generalizes well to CLIP-style domain shifts. The work highlights the importance of tailoring adaptation modules to visual inductive biases, offering a simple yet effective direction for vision-specific PETL design.

Abstract

The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune language models and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.

Convolutional Bypasses Are Better Vision Transformer Adapters

TL;DR

This paper presents Convpass, a vision-oriented parameter-efficient adaptation module for Vision Transformers (ViT) that uses convolutional bypass blocks in parallel with MHSA/MLP layers. Convpass introduces a lightweight, inductive bias via 3x3 convolutions to better capture spatial structure, achieving strong results on VTAB-1K and few-shot benchmarks while requiring only a tiny fraction of trainable parameters. Extensive experiments demonstrate Convpass outperforms language-oriented PETL methods across natural, specialized, and structured tasks, and also generalizes well to CLIP-style domain shifts. The work highlights the importance of tailoring adaptation modules to visual inductive biases, offering a simple yet effective direction for vision-specific PETL design.

Abstract

The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune language models and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.
Paper Structure (37 sections, 8 equations, 6 figures, 8 tables)

This paper contains 37 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Average accuracy vs. number of trainable parameters on VTAB-1K benchmark. Our vision-oriented Convpass outperforms other language-oriented methods.
  • Figure 2: Illustration of the unraveled view of ViT equipped with Adapter. For simplicity, we show the unraveled view of a fragment of ViT (MHSA-MLP-MHSA) and the type of each path. Normalization layers are omitted.
  • Figure 3: Overview of the proposed method. We restore the spatial structure of the token sequence, and use trainable ResNet-style convolutional blocks as bypasses. The [cls] token is regarded as an individual image.
  • Figure 4: Group-wise average results on VTAB-1K. Convpass outperforms other baselines in all of the three groups.
  • Figure 5: Results of few-shot learning on five fine-grained visual recognition datasets. Convpass outperforms other baselines on average results.
  • ...and 1 more figures