Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision
Minglei Li, Peng Ye, Yongqi Huang, Lin Zhang, Tao Chen, Tong He, Jiayuan Fan, Wanli Ouyang
TL;DR
Adapter-X tackles the efficiency-generalization trade-off in vision PEFT by introducing a Sharing Mixture of Adapters (SMoA) that enables token-level dynamic routing across a shared expert library and across Transformer blocks. It further augments this core with block-specific designs such as per-block normalization and a Prompt Generator to diversify inputs. Empirical results on 2D VTAB-1K and 3D ScanObjectNN/ModelNet40 show that Adapter-X can outperform full fine-tuning while using only a small fraction of trainable parameters (0.20% for 2D and 1.88% for 3D in reported settings), marking a significant efficiency gain without sacrificing accuracy. The work highlights the importance of combining parameter sharing, dynamic allocation, and per-block customization to achieve robust cross-task generalization in PEFT frameworks.
Abstract
Parameter-efficient fine-tuning (PEFT) has become increasingly important as foundation models continue to grow in both popularity and size. Adapter has been particularly well-received due to their potential for parameter reduction and adaptability across diverse tasks. However, striking a balance between high efficiency and robust generalization across tasks remains a challenge for adapter-based methods. We analyze existing methods and find that: 1) parameter sharing is the key to reducing redundancy; 2) more tunable parameters, dynamic allocation, and block-specific design are keys to improving performance. Unfortunately, no previous work considers all these factors. Inspired by this insight, we introduce a novel framework named Adapter-X. First, a Sharing Mixture of Adapters (SMoA) module is proposed to fulfill token-level dynamic allocation, increased tunable parameters, and inter-block sharing at the same time. Second, some block-specific designs like Prompt Generator (PG) are introduced to further enhance the ability of adaptation. Extensive experiments across 2D image and 3D point cloud modalities demonstrate that Adapter-X represents a significant milestone as it is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks. Our code will be publicly available.
