Table of Contents
Fetching ...

Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning

Ningyuan Tang, Minghao Fu, Ke Zhu, Jianxin Wu

TL;DR

This paper tackles the GPU memory and speed bottlenecks of parameter-efficient fine-tuning (PEFT) by proposing LAST, a side-tuning framework that completely freezes the pretrained backbone and its outputs while learning a side-network built from Low-rank Self-Attention (LSA) blocks. The LSA module projects tokens to a very low dimension $r$ ($r \ll d$), performs self-attention in this reduced space, and projects back, eliminating reliance on large FFNs and avoiding backpropagation through the backbone. A bias correction is introduced to ensure the pretrained representation remains properly isolated, yielding a representation $u_m - \sum_{i=0}^{m-1} z_i$ that separates the task-specific component from the frozen backbone. LAST enables highly parallelizable training across multiple hyperparameter settings and achieves state-of-the-art PEFT performance on VTAB-1K and FGVC datasets with substantially reduced memory footprint (e.g., around $1.33$ GB on ViT-B/16) and faster training times, while remaining scalable to large backbones such as ViT-g. The approach promises practical impact by enabling efficient fine-tuning of very large models on modest hardware and suggests extensions to other backbones and modalities, including large language models.

Abstract

In finetuning a large pretrained model to downstream tasks, parameter-efficient fine-tuning (PEFT) methods can effectively finetune pretrained models with few trainable parameters, but suffer from high GPU memory consumption and slow training speed. Because learnable parameters from these methods are entangled with the pretrained model, gradients related to the frozen pretrained model's parameters have to be computed and stored during finetuning. We propose Low-rank Attention Side-Tuning (LAST), which disentangles the trainable module from the pretrained model by freezing not only parameters but also outputs of the pretrained network. LAST trains a side-network composed of only low-rank self-attention modules. By viewing the pretrained model as a frozen feature extractor, the side-network takes intermediate output from the pretrained model and focus on learning task-specific knowledge. We also show that LAST can be highly parallel across multiple optimization objectives, making it very efficient in downstream task adaptation, for example, in finding optimal hyperparameters. LAST outperforms previous state-of-the-art methods on VTAB-1K and other visual adaptation tasks with roughly only 30\% of GPU memory footprint and 60\% of training time compared to existing PEFT methods, but achieves significantly higher accuracy.

Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning

TL;DR

This paper tackles the GPU memory and speed bottlenecks of parameter-efficient fine-tuning (PEFT) by proposing LAST, a side-tuning framework that completely freezes the pretrained backbone and its outputs while learning a side-network built from Low-rank Self-Attention (LSA) blocks. The LSA module projects tokens to a very low dimension (), performs self-attention in this reduced space, and projects back, eliminating reliance on large FFNs and avoiding backpropagation through the backbone. A bias correction is introduced to ensure the pretrained representation remains properly isolated, yielding a representation that separates the task-specific component from the frozen backbone. LAST enables highly parallelizable training across multiple hyperparameter settings and achieves state-of-the-art PEFT performance on VTAB-1K and FGVC datasets with substantially reduced memory footprint (e.g., around GB on ViT-B/16) and faster training times, while remaining scalable to large backbones such as ViT-g. The approach promises practical impact by enabling efficient fine-tuning of very large models on modest hardware and suggests extensions to other backbones and modalities, including large language models.

Abstract

In finetuning a large pretrained model to downstream tasks, parameter-efficient fine-tuning (PEFT) methods can effectively finetune pretrained models with few trainable parameters, but suffer from high GPU memory consumption and slow training speed. Because learnable parameters from these methods are entangled with the pretrained model, gradients related to the frozen pretrained model's parameters have to be computed and stored during finetuning. We propose Low-rank Attention Side-Tuning (LAST), which disentangles the trainable module from the pretrained model by freezing not only parameters but also outputs of the pretrained network. LAST trains a side-network composed of only low-rank self-attention modules. By viewing the pretrained model as a frozen feature extractor, the side-network takes intermediate output from the pretrained model and focus on learning task-specific knowledge. We also show that LAST can be highly parallel across multiple optimization objectives, making it very efficient in downstream task adaptation, for example, in finding optimal hyperparameters. LAST outperforms previous state-of-the-art methods on VTAB-1K and other visual adaptation tasks with roughly only 30\% of GPU memory footprint and 60\% of training time compared to existing PEFT methods, but achieves significantly higher accuracy.
Paper Structure (22 sections, 7 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: GPU memory footprint, accuracy, and training speed of different PEFT methods on the VTAB-1K zhai2019large benchmark. GPU memory and training throughput are tested with batch size 32 on one NVIDIA TITAN-Xp GPU. The proposed LAST method outperforms other methods with significantly lower GPU memory usage and higher training speed. The pretrained model is ViT-B. With its advantages, our LAST can even finetune the huge ViT-g model oquab2023dinov2 on single NVIDIA TITAN Xp GPU (which has only 12GB memory and was released in year 2017)!
  • Figure 2: Overall architecture of the proposed LAST method. In the left, for a group of $g$ Transformer blocks, we insert an LSA block in the side-network, whose details are in the right part of this figure. Note that there is no arrow from the LSA modules (bottom branch) back to the pretrained network (top branch), hence the pretrained network can be viewed as a standalone feature extractor.
  • Figure 3: Effect of the bias correction in the LSA module. Y-axis represents the group-wise average accuracy on VTAB-1K.
  • Figure 4: Effect of varying $T$ in LAST. Y-axis represents the group-wise average accuracy on VTAB-1K.
  • Figure 5: t-SNE Visualization of feataure distribution from four different finetuning methods: linear probing, full finetuning, LoRA and our LAST. "dSprites/loc" dataset comes from "Structured" group of VTAB-1K, "Resisc45" from "Specialized" group and "Caltech101" from "Natural" group. Dots of different colors belong to different categories. (Best viewed in color.)