Table of Contents
Fetching ...

Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

Jiahuan Long, Tingsong Jiang, Wen Yao, Yizhe Xiong, Zhengqin Xu, Shuai Jia, Hanqing Liu, Chao Ma

TL;DR

The paper tackles the problem of adapting large vision foundation models without updating parameters by identifying and removing task-irrelevant feature redundancy. It introduces a channel-replacement strategy that selects redundant channels and swaps them with more informative ones using an output-difference based search, restricting the exploration with a top-N heuristic and a small search dataset to achieve inference-only adaptation. The approach is formalized with replacement pairs and an objective to maximize downstream $mIoU$, and demonstrated to yield consistent gains across SAM/SAM2 backbones on segmentation, depth estimation, and image classification, while seamlessly integrating with existing PEFT methods. Practically, the method reduces GPU memory overhead and broadens the applicability of VFMs to diverse downstream tasks without costly fine-tuning.

Abstract

Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.

Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

TL;DR

The paper tackles the problem of adapting large vision foundation models without updating parameters by identifying and removing task-irrelevant feature redundancy. It introduces a channel-replacement strategy that selects redundant channels and swaps them with more informative ones using an output-difference based search, restricting the exploration with a top-N heuristic and a small search dataset to achieve inference-only adaptation. The approach is formalized with replacement pairs and an objective to maximize downstream , and demonstrated to yield consistent gains across SAM/SAM2 backbones on segmentation, depth estimation, and image classification, while seamlessly integrating with existing PEFT methods. Practically, the method reduces GPU memory overhead and broadens the applicability of VFMs to diverse downstream tasks without costly fine-tuning.

Abstract

Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.

Paper Structure

This paper contains 11 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A comparison between our method and other fine-tuning methods. (a) shows a fine-tuning method that updates the decoder to align pretrained features with the downstream task; (b) illustrates a fine-tuning method that updates the encoder to modify pretrained features for downstream adaptation. In contrast, (c) depicts our method, which adapts pretrained features to the downstream task by replacing specific redundant channels without any parameter updates.
  • Figure 2: Pipeline of the search process for optimal replacement pair combination. First, we use a subset of the training dataset, referred to as the "search dataset," which is fed into SAM’s encoder to obtain pretrained features. Next, in the channel replacement module, each source channel is sequentially replaced with a target channel, and the modified features are sent to the decoder. By comparing the decoded results from the original and modified features, we construct a dictionary $\mathcal{D}$ that records each replacement pair and its corresponding output difference. After sorting the top $N$ pairs in $\mathcal{D}$, we further explore the optimal replacement pair combination $P^{*}$ within the subset $\mathcal{D}_{topN}$. By applying $P^{*}$, we replace the redundant channels with more effective ones, thereby enhancing the performance on downstream tasks.
  • Figure 3: Qualitative comparison of various fine-tuning methods for SAM across natural, medical, and camouflage scenarios. Columns from left to right show the original input image (Input), the ground-truth segmentation mask (GT), the segmentation results from the original SAM (Base), and the results before and after applying our method to various fine-tuned methods (i.e., Adapter, DoRA). The results demonstrate that our method effectively enhances the performance of already fine-tuned models, producing more refined predictions that are closer to the ground truths, as highlighted in Red boxes. Refer to the Appendix for more visualizations.
  • Figure 4: Performance Comparison of varying number of replacement pairs. This suggests that combining different replacement pairs can further improve segmentation performance.
  • Figure 5: Comparison of effective and redundant channels in optimal replacement pairs for downstream tasks. "Effective channels" are used to replace the "Redundant channels" to improve task-specific performance. The colors in the feature map from green to yellow represent the response intensity from weak to strong. (a)-(c) showcase natural, medical and camouflage scenarios, respectively, showing effective channel features have more discernible structures, edges and textures compared to redundant channel features.