SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels
Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou
TL;DR
This work tackles parameter-efficient fine-tuning of vision transformers by exploiting task-specific information extracted from downstream images. It introduces Salient Channel Tuning (SCT) and a Class-Aware Importance Score (CAIS) to identify a small, task-relevant subset of channels (about $12.5 ext{ extpercent}$ of channels, i.e., $K=\frac{1}{8}$) and tunes them via a lightweight SCTM, keeping most backbone weights frozen. SCT achieves strong transfer across 19 VTAB-1K datasets with only $\approx 0.11$M trainable parameters (roughly $780\times$ fewer than full fine-tuning) and outperforms full fine-tuning on most tasks; it also demonstrates robust domain generalization and few-shot performance with hierarchical backbones like Swin-B. The method is simple to implement, does not rely on prompts or external adapters, and offers practical efficiency gains for resource-constrained deployment, while maintaining competitive or superior accuracy. Key insights include the layer-wise saliency of channels, the effectiveness of inserting SCTM after attention or MLP blocks, and the value of class-aware channel selection for balanced, task-specific adaptation.
Abstract
Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1\% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments on 19 visual transfer learning downstream tasks demonstrate that our SCT outperforms full fine-tuning on 18 out of 19 tasks by adding only 0.11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot classification further demonstrate the effectiveness and generic of our approach. The code is available at https://github.com/showlab/SCT.
