Table of Contents
Fetching ...

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Jia-Chen Zhang, Yu-Jie Xiong, He-Xi Qiu, Dong-Hai Zhu, Chun-Ming Xia

TL;DR

LoRA^2 tackles the challenge of efficiently fine-tuning extremely large language models by introducing multi-scale, orthogonal low-rank updates (LoRA^2) that expand the learnable parameter space while maintaining parameter efficiency. It combines a two-plane, SVD-based orthogonal approximation with dual regularizers and AdaLoRA-inspired, dynamically pruned rank updates; importantly, it reduces the cost of importance-score computation by exploiting structure in complex matrices. Empirically, LoRA^2 achieves comparable or superior performance to full fine-tuning and strong PEFT baselines while using as little as 0.72% of trainable parameters, and exhibits robust performance across DeBERTaV3-base and RoBERTa-large on GLUE. The approach demonstrates significant practical impact by enabling scalable fine-tuning for large models with dramatically reduced training resources, and sets the stage for broader multi-scale, orthogonal parameter-efficient methods.

Abstract

Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream tasks.In this paper, we extend the LoRA to multiple scales, dubbed as LoRA$^2$. We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5\%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA$^2$. Results show that it significantly reduces the number of trainable parameters to just 0.72\% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

TL;DR

LoRA^2 tackles the challenge of efficiently fine-tuning extremely large language models by introducing multi-scale, orthogonal low-rank updates (LoRA^2) that expand the learnable parameter space while maintaining parameter efficiency. It combines a two-plane, SVD-based orthogonal approximation with dual regularizers and AdaLoRA-inspired, dynamically pruned rank updates; importantly, it reduces the cost of importance-score computation by exploiting structure in complex matrices. Empirically, LoRA^2 achieves comparable or superior performance to full fine-tuning and strong PEFT baselines while using as little as 0.72% of trainable parameters, and exhibits robust performance across DeBERTaV3-base and RoBERTa-large on GLUE. The approach demonstrates significant practical impact by enabling scalable fine-tuning for large models with dramatically reduced training resources, and sets the stage for broader multi-scale, orthogonal parameter-efficient methods.

Abstract

Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream tasks.In this paper, we extend the LoRA to multiple scales, dubbed as LoRA. We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5\%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA. Results show that it significantly reduces the number of trainable parameters to just 0.72\% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C
Paper Structure (15 sections, 11 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Blue blocks represent frozen parameters, while orange represents trainable parameters. (A) LoRA only utilizes a set of low-rank matrices to approximate increments. (B) LoRA$^2$ trains a set of low-rank matrices in two mutually orthogonal planes.
  • Figure 2: An illustration of Multi-Scale Orthogonal Low-Rank Approximations (LoRA$^2$): Based on the principle of orthogonal projection, We train a set of inherently orthogonal LoRAs on orthogonal planes as the incremental matrices.
  • Figure 3: The final rankings after training with LoRA$^2$$(r=8)$ on four datasets (i.e., MNLI, QQP, MRPC, and SST2). The X-axis is the index of DeBERTaV3-base layers, and the Y-axis indicates the different layers to which LoRA$^2$ is applied. The lighter the color, the lower the degree of pruning.