1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization
Dongshuo Yin, Xue Yang, Deng-Ping Fan, Shi-Min Hu
TL;DR
CoLin introduces a parameter-efficient visual adapter that uses a multi-branch low-rank complex projection to adapt vision foundation models with about 1% new parameters. The authors provide a matrix-theoretic analysis showing gradient-direction entanglement in low-rank synthesis and propose an orthogonal loss with SVD-based initialization to mitigate it, boosting convergence. Extensive experiments across object detection, segmentation, classification, and rotated object detection demonstrate that CoLin consistently outperforms full fine-tuning and delta-tuning baselines. The approach offers practical transfer efficiency for large-scale visual models, enabling deployment in diverse domains such as remote sensing and medical imaging.
Abstract
Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.
