Table of Contents
Fetching ...

Serial Low-rank Adaptation of Vision Transformer

Houqiang Zhong, Shaocheng Shen, Ke Cai, Zhenglong Wu, Jiangchao Yao, Yuan Cheng, Xuefei Li, Xiaoyun Zhang, Li Song, Qiang Hu

TL;DR

This work addresses the high cost of fine-tuning large vision transformers by introducing Serial LoRA, a shared low-rank adaptation that is serially composed with attention mechanisms. By learning a single shared low-rank pair $\Delta \mathbf{W}_s = \mathbf{B}_s \mathbf{A}_s$ and applying it before the pre-trained projections, the method reduces trainable parameters to about 1/4 of standard LoRA while maintaining competitive performance. Extensive experiments across diffusion generation (Stable Diffusion 3.0), CLIP-based classification, and SAM segmentation on 24 datasets demonstrate consistent parameter efficiency and compatibility with LoRA+ enhancements. The approach preserves representation power, supported by singular value analyses showing an intermediate spectrum between the pre-trained model and LoRA, and offers broad applicability to various transformer-based vision foundation models. Overall, Serial LoRA enables scalable, resource-efficient fine-tuning for diverse vision tasks without sacrificing accuracy.

Abstract

Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.

Serial Low-rank Adaptation of Vision Transformer

TL;DR

This work addresses the high cost of fine-tuning large vision transformers by introducing Serial LoRA, a shared low-rank adaptation that is serially composed with attention mechanisms. By learning a single shared low-rank pair and applying it before the pre-trained projections, the method reduces trainable parameters to about 1/4 of standard LoRA while maintaining competitive performance. Extensive experiments across diffusion generation (Stable Diffusion 3.0), CLIP-based classification, and SAM segmentation on 24 datasets demonstrate consistent parameter efficiency and compatibility with LoRA+ enhancements. The approach preserves representation power, supported by singular value analyses showing an intermediate spectrum between the pre-trained model and LoRA, and offers broad applicability to various transformer-based vision foundation models. Overall, Serial LoRA enables scalable, resource-efficient fine-tuning for diverse vision tasks without sacrificing accuracy.

Abstract

Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.

Paper Structure

This paper contains 12 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Left: Serial LoRA structure, illustrating the application of Serial LoRA in Transformers. Right: Comparison results of Serial LoRA and LoRA across various computer vision tasks, including image generation, image classification and semantic segmentation.
  • Figure 2: Instead of learning separate pairs of matrices, Serial LoRA learns a shared pair of low-rank matrices, significantly reducing the training parameter requirements. Its strong scalability allows it to be directly applied to various vision tasks, such as CLIP, Stable Diffusion 3.0 and SAM, enhancing efficiency across diverse applications.
  • Figure 3: Singular value analysis varying different transformer blocks in SAM model. The comparison between SAM weights (red), Serial LoRA (green), LoRA (blue), LoRA+ (purple), and Serial LoRA+ (orange) demonstrates that Serial LoRA variants maintain intermediary singular value distributions with gradual decay patterns across different network depths.
  • Figure 4: Qualitative Results about the comparison of Serial LoRA and LoRA.
  • Figure 5: Comparison of LoRA and Serial LoRA performance in the CLIP model on image classification tasks from rank 8 to 64, measured in accuracy (%). The datasets used include Food101, DTD, Caltech101, and Oxford Pets.