Table of Contents
Fetching ...

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

Dongshuo Yin, Xue Yang, Deng-Ping Fan, Shi-Min Hu

TL;DR

CoLin introduces a parameter-efficient visual adapter that uses a multi-branch low-rank complex projection to adapt vision foundation models with about 1% new parameters. The authors provide a matrix-theoretic analysis showing gradient-direction entanglement in low-rank synthesis and propose an orthogonal loss with SVD-based initialization to mitigate it, boosting convergence. Extensive experiments across object detection, segmentation, classification, and rotated object detection demonstrate that CoLin consistently outperforms full fine-tuning and delta-tuning baselines. The approach offers practical transfer efficiency for large-scale visual models, enabling deployment in diverse domains such as remote sensing and medical imaging.

Abstract

Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

TL;DR

CoLin introduces a parameter-efficient visual adapter that uses a multi-branch low-rank complex projection to adapt vision foundation models with about 1% new parameters. The authors provide a matrix-theoretic analysis showing gradient-direction entanglement in low-rank synthesis and propose an orthogonal loss with SVD-based initialization to mitigate it, boosting convergence. Extensive experiments across object detection, segmentation, classification, and rotated object detection demonstrate that CoLin consistently outperforms full fine-tuning and delta-tuning baselines. The approach offers practical transfer efficiency for large-scale visual models, enabling deployment in diverse domains such as remote sensing and medical imaging.

Abstract

Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.
Paper Structure (32 sections, 38 equations, 5 figures, 8 tables)

This paper contains 32 sections, 38 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Performance radar. The proposed method outperforms full fine-tuning and typical delta-tuning approaches (based on Swin-B/L) on 8 visual tasks by fine-tuning only about 1% new params for the first time. This parameter efficiency advantage can be further enhanced on larger foundation models. The maximum on each axis represents the best performance of each dataset.
  • Figure 2: Module schematic. The down-projection $W^D$ and up-projection $W^U$ matrices are the summation of $\alpha$ branches $W_1^D(W_1^U)...W_{\alpha}^D(W_{\alpha}^U)$. $K_i$ in $i$-th branch is shared between $W_i^D$ and $W_i^U$. All $P$ and $Q$ are shared among branches. All $K_i$ are trainable, and all the $W$ matrices are calculated. A single depth-wise (DW) convolution layer is added before GeLU.
  • Figure 3: Convergence simulation. Theoretical analysis and simulation experiments demonstrate that in low-rank synthesis scenarios, orthogonality (blue) significantly enhances the overall convergence efficiency of the model.
  • Figure 4: Impact of Orthogonal Loss on Matrices of Different Sizes. All iterations range from 0 to 2000. The difference among the three figures lies in the second dimension size of the matrices. As the size increases, the benefit of adding orthogonal loss becomes greater.
  • Figure 5: Insertion location. CoLin is inserted after the two skip connections in each SwinBlock.