1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

Dongshuo Yin; Xue Yang; Deng-Ping Fan; Shi-Min Hu

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

Dongshuo Yin, Xue Yang, Deng-Ping Fan, Shi-Min Hu

TL;DR

CoLin introduces a parameter-efficient visual adapter that uses a multi-branch low-rank complex projection to adapt vision foundation models with about 1% new parameters. The authors provide a matrix-theoretic analysis showing gradient-direction entanglement in low-rank synthesis and propose an orthogonal loss with SVD-based initialization to mitigate it, boosting convergence. Extensive experiments across object detection, segmentation, classification, and rotated object detection demonstrate that CoLin consistently outperforms full fine-tuning and delta-tuning baselines. The approach offers practical transfer efficiency for large-scale visual models, enabling deployment in diverse domains such as remote sensing and medical imaging.

Abstract

Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

TL;DR

Abstract

Paper Structure (32 sections, 38 equations, 5 figures, 8 tables)

This paper contains 32 sections, 38 equations, 5 figures, 8 tables.

Introduction
Related Work
Delta-tuning
Adapter Optimization
Methods
Multi-branch Low-rank Projection
Standard Adapter Linear Projection
Low-rank Linear Projection
Multi-branch Linear Projection
Complex Sharing Strategy
Kernel Sharing
Branch Sharing
Orthogonal Optimization of the Parameter Space
SVD-based Initialization
Parameter Analysis
...and 17 more sections

Figures (5)

Figure 1: Performance radar. The proposed method outperforms full fine-tuning and typical delta-tuning approaches (based on Swin-B/L) on 8 visual tasks by fine-tuning only about 1% new params for the first time. This parameter efficiency advantage can be further enhanced on larger foundation models. The maximum on each axis represents the best performance of each dataset.
Figure 2: Module schematic. The down-projection $W^D$ and up-projection $W^U$ matrices are the summation of $\alpha$ branches $W_1^D(W_1^U)...W_{\alpha}^D(W_{\alpha}^U)$. $K_i$ in $i$-th branch is shared between $W_i^D$ and $W_i^U$. All $P$ and $Q$ are shared among branches. All $K_i$ are trainable, and all the $W$ matrices are calculated. A single depth-wise (DW) convolution layer is added before GeLU.
Figure 3: Convergence simulation. Theoretical analysis and simulation experiments demonstrate that in low-rank synthesis scenarios, orthogonality (blue) significantly enhances the overall convergence efficiency of the model.
Figure 4: Impact of Orthogonal Loss on Matrices of Different Sizes. All iterations range from 0 to 2000. The difference among the three figures lies in the second dimension size of the matrices. As the size increases, the benefit of adding orthogonal loss becomes greater.
Figure 5: Insertion location. CoLin is inserted after the two skip connections in each SwinBlock.

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

TL;DR

Abstract

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)