Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models
Jiahuan Long, Tingsong Jiang, Wen Yao, Yizhe Xiong, Zhengqin Xu, Shuai Jia, Hanqing Liu, Chao Ma
TL;DR
The paper tackles the problem of adapting large vision foundation models without updating parameters by identifying and removing task-irrelevant feature redundancy. It introduces a channel-replacement strategy that selects redundant channels and swaps them with more informative ones using an output-difference based search, restricting the exploration with a top-N heuristic and a small search dataset to achieve inference-only adaptation. The approach is formalized with replacement pairs and an objective to maximize downstream $mIoU$, and demonstrated to yield consistent gains across SAM/SAM2 backbones on segmentation, depth estimation, and image classification, while seamlessly integrating with existing PEFT methods. Practically, the method reduces GPU memory overhead and broadens the applicability of VFMs to diverse downstream tasks without costly fine-tuning.
Abstract
Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.
