Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
Yurong Zhang, Honghao Chen, Xinyu Zhang, Xiangxiang Chu, Li Song
TL;DR
Large pre-trained vision models incur high fine-tuning costs; PETL methods often fail to deliver practical inference efficiency and can entangle representations across layers. Dyn-Adapter freezes the backbone and adds multi-stage early exits with dynamic balanced heads, plus a bidirectional sparsification strategy and dynamic inference to disentangle features and enable input-dependent computation. It achieves up to 50% FLOPs reduction with maintained or improved accuracy across LoRA, Adapter, and Rep-Adapter baselines on VTAB-1k and video benchmarks, and generalizes across SSL objectives and few-shot settings. This approach provides a simple, versatile efficiency booster for PETL in visual recognition with broad practical impact.
Abstract
Parameter-efficient transfer learning (PETL) is a promising task, aiming to adapt the large-scale pre-trained model to downstream tasks with a relatively modest cost. However, current PETL methods struggle in compressing computational complexity and bear a heavy inference burden due to the complete forward process. This paper presents an efficient visual recognition paradigm, called Dynamic Adapter (Dyn-Adapter), that boosts PETL efficiency by subtly disentangling features in multiple levels. Our approach is simple: first, we devise a dynamic architecture with balanced early heads for multi-level feature extraction, along with adaptive training strategy. Second, we introduce a bidirectional sparsity strategy driven by the pursuit of powerful generalization ability. These qualities enable us to fine-tune efficiently and effectively: we reduce FLOPs during inference by 50%, while maintaining or even yielding higher recognition accuracy. Extensive experiments on diverse datasets and pretrained backbones demonstrate the potential of Dyn-Adapter serving as a general efficiency booster for PETL in vision recognition tasks.
