Table of Contents
Fetching ...

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition

Yurong Zhang, Honghao Chen, Xinyu Zhang, Xiangxiang Chu, Li Song

TL;DR

Large pre-trained vision models incur high fine-tuning costs; PETL methods often fail to deliver practical inference efficiency and can entangle representations across layers. Dyn-Adapter freezes the backbone and adds multi-stage early exits with dynamic balanced heads, plus a bidirectional sparsification strategy and dynamic inference to disentangle features and enable input-dependent computation. It achieves up to 50% FLOPs reduction with maintained or improved accuracy across LoRA, Adapter, and Rep-Adapter baselines on VTAB-1k and video benchmarks, and generalizes across SSL objectives and few-shot settings. This approach provides a simple, versatile efficiency booster for PETL in visual recognition with broad practical impact.

Abstract

Parameter-efficient transfer learning (PETL) is a promising task, aiming to adapt the large-scale pre-trained model to downstream tasks with a relatively modest cost. However, current PETL methods struggle in compressing computational complexity and bear a heavy inference burden due to the complete forward process. This paper presents an efficient visual recognition paradigm, called Dynamic Adapter (Dyn-Adapter), that boosts PETL efficiency by subtly disentangling features in multiple levels. Our approach is simple: first, we devise a dynamic architecture with balanced early heads for multi-level feature extraction, along with adaptive training strategy. Second, we introduce a bidirectional sparsity strategy driven by the pursuit of powerful generalization ability. These qualities enable us to fine-tune efficiently and effectively: we reduce FLOPs during inference by 50%, while maintaining or even yielding higher recognition accuracy. Extensive experiments on diverse datasets and pretrained backbones demonstrate the potential of Dyn-Adapter serving as a general efficiency booster for PETL in vision recognition tasks.

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition

TL;DR

Large pre-trained vision models incur high fine-tuning costs; PETL methods often fail to deliver practical inference efficiency and can entangle representations across layers. Dyn-Adapter freezes the backbone and adds multi-stage early exits with dynamic balanced heads, plus a bidirectional sparsification strategy and dynamic inference to disentangle features and enable input-dependent computation. It achieves up to 50% FLOPs reduction with maintained or improved accuracy across LoRA, Adapter, and Rep-Adapter baselines on VTAB-1k and video benchmarks, and generalizes across SSL objectives and few-shot settings. This approach provides a simple, versatile efficiency booster for PETL in visual recognition with broad practical impact.

Abstract

Parameter-efficient transfer learning (PETL) is a promising task, aiming to adapt the large-scale pre-trained model to downstream tasks with a relatively modest cost. However, current PETL methods struggle in compressing computational complexity and bear a heavy inference burden due to the complete forward process. This paper presents an efficient visual recognition paradigm, called Dynamic Adapter (Dyn-Adapter), that boosts PETL efficiency by subtly disentangling features in multiple levels. Our approach is simple: first, we devise a dynamic architecture with balanced early heads for multi-level feature extraction, along with adaptive training strategy. Second, we introduce a bidirectional sparsity strategy driven by the pursuit of powerful generalization ability. These qualities enable us to fine-tune efficiently and effectively: we reduce FLOPs during inference by 50%, while maintaining or even yielding higher recognition accuracy. Extensive experiments on diverse datasets and pretrained backbones demonstrate the potential of Dyn-Adapter serving as a general efficiency booster for PETL in vision recognition tasks.
Paper Structure (14 sections, 10 equations, 4 figures, 7 tables)

This paper contains 14 sections, 10 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison$_{\!}$ of$_{\!}$Dyn-Adapter$_{\!}$ and$_{\!}$ baselines (top: para- digm difference, bottom: performance contrast). The throughput is measured on a NVIDIA 3090 GPU with a batch size of 1.
  • Figure 2: Overview of our Dyn-Adapter paradigm. Multiple early supervisions are introduced to facilitate dynamic inference (section \ref{['secdyn']}). Adaptive learning and bidirectional sparsification strategy effectively address Dyn-Adapter optimization (section \ref{['bi']}). Dashed lines indicate that the connection between neuron nodes will be dropped during forward or backward.
  • Figure 3: Dynamic inference process.
  • Figure 4: CKA of corresponding features and labels. Perc. is the abbreviation of Dyn-Perceiver.