Serial Low-rank Adaptation of Vision Transformer
Houqiang Zhong, Shaocheng Shen, Ke Cai, Zhenglong Wu, Jiangchao Yao, Yuan Cheng, Xuefei Li, Xiaoyun Zhang, Li Song, Qiang Hu
TL;DR
This work addresses the high cost of fine-tuning large vision transformers by introducing Serial LoRA, a shared low-rank adaptation that is serially composed with attention mechanisms. By learning a single shared low-rank pair $\Delta \mathbf{W}_s = \mathbf{B}_s \mathbf{A}_s$ and applying it before the pre-trained projections, the method reduces trainable parameters to about 1/4 of standard LoRA while maintaining competitive performance. Extensive experiments across diffusion generation (Stable Diffusion 3.0), CLIP-based classification, and SAM segmentation on 24 datasets demonstrate consistent parameter efficiency and compatibility with LoRA+ enhancements. The approach preserves representation power, supported by singular value analyses showing an intermediate spectrum between the pre-trained model and LoRA, and offers broad applicability to various transformer-based vision foundation models. Overall, Serial LoRA enables scalable, resource-efficient fine-tuning for diverse vision tasks without sacrificing accuracy.
Abstract
Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.
