LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Jia-Chen Zhang; Yu-Jie Xiong; He-Xi Qiu; Dong-Hai Zhu; Chun-Ming Xia

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Jia-Chen Zhang, Yu-Jie Xiong, He-Xi Qiu, Dong-Hai Zhu, Chun-Ming Xia

TL;DR

LoRA^2 tackles the challenge of efficiently fine-tuning extremely large language models by introducing multi-scale, orthogonal low-rank updates (LoRA^2) that expand the learnable parameter space while maintaining parameter efficiency. It combines a two-plane, SVD-based orthogonal approximation with dual regularizers and AdaLoRA-inspired, dynamically pruned rank updates; importantly, it reduces the cost of importance-score computation by exploiting structure in complex matrices. Empirically, LoRA^2 achieves comparable or superior performance to full fine-tuning and strong PEFT baselines while using as little as 0.72% of trainable parameters, and exhibits robust performance across DeBERTaV3-base and RoBERTa-large on GLUE. The approach demonstrates significant practical impact by enabling scalable fine-tuning for large models with dramatically reduced training resources, and sets the stage for broader multi-scale, orthogonal parameter-efficient methods.

Abstract

Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream tasks.In this paper, we extend the LoRA to multiple scales, dubbed as LoRA$^2$. We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5\%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA$^2$. Results show that it significantly reduces the number of trainable parameters to just 0.72\% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

TL;DR

Abstract

. We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5\%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA

. Results show that it significantly reduces the number of trainable parameters to just 0.72\% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C

Paper Structure (15 sections, 11 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 3 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Our Method
Multi-Scale Orthogonal Approximation
Complex Matrix Importance Pruning
Experiments
Experimental Settings
Results
Rank Analysis
Orthogonal Constraint Analysis
Applying LoRA$^2$ to Different Weights
Conclusion
Datasets
Sparse Regularization Theory
Orthogonal Projection Theory

Figures (3)

Figure 1: Blue blocks represent frozen parameters, while orange represents trainable parameters. (A) LoRA only utilizes a set of low-rank matrices to approximate increments. (B) LoRA$^2$ trains a set of low-rank matrices in two mutually orthogonal planes.
Figure 2: An illustration of Multi-Scale Orthogonal Low-Rank Approximations (LoRA$^2$): Based on the principle of orthogonal projection, We train a set of inherently orthogonal LoRAs on orthogonal planes as the incremental matrices.
Figure 3: The final rankings after training with LoRA$^2$$(r=8)$ on four datasets (i.e., MNLI, QQP, MRPC, and SST2). The X-axis is the index of DeBERTaV3-base layers, and the Y-axis indicates the different layers to which LoRA$^2$ is applied. The lighter the color, the lower the degree of pruning.

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

TL;DR

Abstract

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)