Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

Bojia Zi; Xianbiao Qi; Lingzhi Wang; Jianan Wang; Kam-Fai Wong; Lei Zhang

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, Lei Zhang

TL;DR

Delta-LoRA tackles the limited learning capacity of traditional low-rank adapters by updating the full pre-trained weights through the delta of the product of two low-rank matrices, while preserving LoRA-like memory costs. It removes dropout in the low-rank path to align gradients and uses the delta AB as a surrogate for direct W updates, enabling richer representations without extra memory overhead. Across NLP tasks, Delta-LoRA achieves state-of-the-art results among PEFT methods on NLG and NLU benchmarks, supported by thorough ablations and gradient analyses. The approach offers a practical, memory-efficient way to boost fine-tuning performance for large language models.

Abstract

In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs). In contrast to LoRA and other low-rank adaptation methods such as AdaLoRA, Delta-LoRA not only updates the low-rank matrices $\bA$ and $\bB$, but also propagate the learning to the pre-trained weights $\bW$ via updates utilizing the delta of the product of two low-rank matrices ($\bA^{(t+1)}\bB^{(t+1)} - \bA^{(t)}\bB^{(t)}$). Such a strategy effectively addresses the limitation that the incremental update of low-rank matrices is inadequate for learning representations capable for downstream tasks. Moreover, as the update of $\bW$ does not need to compute the gradients of $\bW$ and store their momentums, Delta-LoRA shares comparable memory requirements and computational costs with LoRA. Extensive experiments show that Delta-LoRA significantly outperforms existing low-rank adaptation methods. We further support these results with comprehensive analyses that underscore the effectiveness of Delta-LoRA.

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

TL;DR

Abstract

and

, but also propagate the learning to the pre-trained weights

via updates utilizing the delta of the product of two low-rank matrices (

). Such a strategy effectively addresses the limitation that the incremental update of low-rank matrices is inadequate for learning representations capable for downstream tasks. Moreover, as the update of

does not need to compute the gradients of

and store their momentums, Delta-LoRA shares comparable memory requirements and computational costs with LoRA. Extensive experiments show that Delta-LoRA significantly outperforms existing low-rank adaptation methods. We further support these results with comprehensive analyses that underscore the effectiveness of Delta-LoRA.

Paper Structure (16 sections, 8 equations, 3 figures, 13 tables)

This paper contains 16 sections, 8 equations, 3 figures, 13 tables.

Introduction
Preliminaries
Related Works
Methodology
Update the Delta of Low-rank Matrices on Pre-trained Weights
The structure of our Delta-LoRA
Experiments
Baselines
Natural Language Generation
Natural Language Understanding
Comprehensive Understanding of Delta-LoRA
Conclusion
Appendix
The Expansion of $\triangle$ AB
The Parameter Sensitivity Study
...and 1 more sections

Figures (3)

Figure 1: An overview of the proposed Delta-LoRA structure, compared to LoRA, DyLoRA and AdaLoRA. Note that DyLoRA and LoRA basically share the same architecture. $\boldsymbol{W}$ is the pre-trained weight which is frozen (signified by blue) when performing efficient-parameter fine-tuning in (a) and (b). Orange trapezoids $\boldsymbol{A}$, $\boldsymbol{B}$ and $\boldsymbol{E}$ denote the trainable parameters. In our proposed Delta-LoRA, the light orange rectangle means that pre-trained weights can be updated via the delta. Note that our proposed Delta-LoRA removes the Dropout layer to ensure reasonable delta for pre-trained matrix.
Figure 2: The framework of our proposed Delta-LoRA. The blue arrows represent forward pass while yellow dashed arrows denote backward propagation. The black solid arrows in (b) represent the process of updating the low-rank adaptation matrices $\boldsymbol{A}$ and $\boldsymbol{B}$ with normalized gradients $\widehat{\boldsymbol{g}}_{\boldsymbol{A}}$ and $\widehat{\boldsymbol{g}}_{\boldsymbol{B}}$ multiplied by the learning rate $\eta$, as well as updating the pre-trained weights $\boldsymbol{W}$ with the delta matrix $\triangle \boldsymbol{A}\boldsymbol{B}$ multiplied by the update ratio $\lambda$.
Figure 3: The comparison of Fine-Tuning$\ddag$, LoRA as well as Delta-LoRA for the cosine similarity between the fine-tuned parameters and the original pre-trained parameters in each transformer block. Higher value means higher similarity.

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

TL;DR

Abstract

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

Authors

TL;DR

Abstract

Table of Contents

Figures (3)