Table of Contents
Fetching ...

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou

TL;DR

This work tackles parameter-efficient fine-tuning of vision transformers by exploiting task-specific information extracted from downstream images. It introduces Salient Channel Tuning (SCT) and a Class-Aware Importance Score (CAIS) to identify a small, task-relevant subset of channels (about $12.5 ext{ extpercent}$ of channels, i.e., $K=\frac{1}{8}$) and tunes them via a lightweight SCTM, keeping most backbone weights frozen. SCT achieves strong transfer across 19 VTAB-1K datasets with only $\approx 0.11$M trainable parameters (roughly $780\times$ fewer than full fine-tuning) and outperforms full fine-tuning on most tasks; it also demonstrates robust domain generalization and few-shot performance with hierarchical backbones like Swin-B. The method is simple to implement, does not rely on prompts or external adapters, and offers practical efficiency gains for resource-constrained deployment, while maintaining competitive or superior accuracy. Key insights include the layer-wise saliency of channels, the effectiveness of inserting SCTM after attention or MLP blocks, and the value of class-aware channel selection for balanced, task-specific adaptation.

Abstract

Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1\% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments on 19 visual transfer learning downstream tasks demonstrate that our SCT outperforms full fine-tuning on 18 out of 19 tasks by adding only 0.11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot classification further demonstrate the effectiveness and generic of our approach. The code is available at https://github.com/showlab/SCT.

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

TL;DR

This work tackles parameter-efficient fine-tuning of vision transformers by exploiting task-specific information extracted from downstream images. It introduces Salient Channel Tuning (SCT) and a Class-Aware Importance Score (CAIS) to identify a small, task-relevant subset of channels (about of channels, i.e., ) and tunes them via a lightweight SCTM, keeping most backbone weights frozen. SCT achieves strong transfer across 19 VTAB-1K datasets with only M trainable parameters (roughly fewer than full fine-tuning) and outperforms full fine-tuning on most tasks; it also demonstrates robust domain generalization and few-shot performance with hierarchical backbones like Swin-B. The method is simple to implement, does not rely on prompts or external adapters, and offers practical efficiency gains for resource-constrained deployment, while maintaining competitive or superior accuracy. Key insights include the layer-wise saliency of channels, the effectiveness of inserting SCTM after attention or MLP blocks, and the value of class-aware channel selection for balanced, task-specific adaptation.

Abstract

Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1\% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments on 19 visual transfer learning downstream tasks demonstrate that our SCT outperforms full fine-tuning on 18 out of 19 tasks by adding only 0.11M parameters of the ViT-B, which is 780 fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot classification further demonstrate the effectiveness and generic of our approach. The code is available at https://github.com/showlab/SCT.
Paper Structure (17 sections, 2 equations, 14 figures, 13 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 14 figures, 13 tables, 1 algorithm.

Figures (14)

  • Figure 1: The comparison of parameters and top-1 accuracy on VTAB-1K benchmark with different baselines. We only tune 96 channels in 768 channels of ViT-B/16, obtaining the best results compared with other methods.
  • Figure 2: The architecture comparison between the Adapter and our SCT. "Downsample" and "Upsample" represent the channel downsampling and upsampling operations. $D$ represents the number of channel dimensions.
  • Figure 3: Visualizations of the extracted feature maps on the Caltech101 dataset at each transformer layer. Y-axis represents the class indices, and X-axis represents channel indices, i.e., 768 in total. We categorize the total image features with the class label and then calculate each channel's $L_2$ values. Thus, we obtain the 768 dimension vector for each class and find that some channels have higher activation values than others in all classes, as the vertical lines are shown in this figure.
  • Figure 4: The top-36 selected salient channel indices on the Caltech101 dataset at each transformer layer i.e., Layer-1, Layer-2. Each layer selects different salient channels. As the layer goes deeper, the salient channels appear more concentrated. Darker color represents a larger activation value.
  • Figure 5: We select the salient channels from the same layer on four downstream tasks. The results suggest that channel bias exists in various tasks.
  • ...and 9 more figures