AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa Chen; Chongjian Ge; Zhan Tong; Jiangliu Wang; Yibing Song; Jue Wang; Ping Luo

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

TL;DR

AdaptFormer introduces AdaptMLP, a lightweight, parallel bottleneck module that replaces the MLP in Vision Transformer encoders to enable efficient, task-specific fine-tuning while freezing the backbone. With tunable parameters below 2% and scalable to image and video domains, AdaptFormer consistently matches or surpasses full fine-tuning and other parameter-efficient methods across benchmarks, including Something-Something V2, HMDB51, and NUS-WIDE. The approach demonstrates strong robustness to increasing tunable parameters, works across ViT variants (and Swin), and enables cross-domain transfer such as image-pretrained backbones for video tasks. Overall, AdaptFormer offers a practical, scalable path toward a universal Vision Transformer for visual recognition with low computational and memory overhead.

Abstract

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and memory storage. Each model needs an independent and complete finetuning process to adapt to different tasks, which limits its transferability to different visual domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100\% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Code is available at https://github.com/ShoufaChen/AdaptFormer.

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 8 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 8 figures, 12 tables, 1 algorithm.

Introduction
Related Works
Transformer in Vision
Efficient Transfer learning for Transformers
Approach
Preliminary and Notation
AdaptFormer
Discussion
Experiments
Experimental Settings
Main Properties and Analysis
Scaling Tunable Parameters Up
Multi-Label Classification
Ablation Studies
Towards Visual Recognition Generalist Agent
...and 15 more sections

Figures (8)

Figure 1: Parameter-Accuracy trade-off. We leverage ViT-Base as backbone and report top-1 accuracy on SSv2 dataset. AdaptFormer can surpass full-tuning with only 0.2% tunable parameters. More detailed results are shown in Table \ref{['tab:ssl_pretrain']}.
Figure 2: Comparison of previous full and our AdaptFormer fine-tuning. AdaptFormer is conceptually simple by replacing the original MLP block with AdaptMLP, which consists of two branches, including the frozen branch (left) and the trainable down$\rightarrow$up bottleneck module (right).
Figure 3: Prompt tuning illustration.
Figure 4: The trend of performance as the number of tunable parameters grows up. The accuracy of VPT drops dramatically when the parameter number exceeds task-specific value, while AdaptFormer is robust to the increasing parameters.
Figure 5: Test accuracy of VPT jia-2022-vpt with different number of introduced tokens. The optimization procedure becomes unstable when the token number is equal or larger than eight on HMDB51 dataset kuehne-iccv11-hmdb.
...and 3 more figures

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

TL;DR

Abstract

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (8)