Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

Wei Dong; Dawei Yan; Zhijun Lin; Peng Wang

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

Wei Dong, Dawei Yan, Zhijun Lin, Peng Wang

TL;DR

This paper tackles the problem of efficiently adapting large Vision Transformer models to downstream tasks without full fine-tuning. It introduces Adapter Re-Composing (ARC), a linear, parameter-sharing approach that reuses a shared low-rank projection basis across layers and composes layer-specific adapters via diagonal re-scaling. The key contributions are the symmetric down-/up-projections, per-layer scaling, and a largely parameter-free inference path, enabling strong transfer performance with far fewer learnable parameters across ViT and Swin backbones. Empirical results on 24 downstream datasets show ARC achieving competitive accuracy while substantially reducing adaptation cost, highlighting its practical impact for scalable pre-trained-model deployment.

Abstract

The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to significantly reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC}.

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

TL;DR

Abstract

Paper Structure (36 sections, 10 equations, 6 figures, 18 tables)

This paper contains 36 sections, 10 equations, 6 figures, 18 tables.

Introduction
Related work
Pre-training and fine-tuning.
Parameter-efficient transfer learning.
Approach
Preliminary
Adapter Re-Composing method
Architecture.
Inference.
Insights of architecture design
Experiments
Experimental settings
Datasets.
Pre-trained backbone.
Baselines and existing methods.
...and 21 more sections

Figures (6)

Figure 1: Visual summary of typical parameter-efficient pre-trained model adaptation methods.
Figure 2: Illustration of the proposed Adapter Re-Composing Method.
Figure 3: Singular value distribution of adaptation matrices without the bottleneck structure. Two adaptation matrices of both MHA and FFN blocks are fine-tuned on the DTD downstream task. The X-axis represents the singular values, while the Y-axis represents the count of singular values within specific ranges. Complete visualization is available in the appendix.
Figure 4: Singular value distribution of adaptation matrices without the bottleneck structure. Two adaptation matrices of both MHA and FFN blocks are fine-tuned on the DTD downstream task. The X-axis represents the singular values, while the Y-axis represents the count of singular values within specific ranges.
Figure 5: The parameter size comparison of lightweight adaptation methods on ViT Backbones of Different Scales. The X-axis represents different adaptation methods, while the Y-axis represents the parameter size in Million (M).
...and 1 more figures

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

TL;DR

Abstract

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)