Table of Contents
Fetching ...

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, Lei Zhang

TL;DR

Delta-LoRA tackles the limited learning capacity of traditional low-rank adapters by updating the full pre-trained weights through the delta of the product of two low-rank matrices, while preserving LoRA-like memory costs. It removes dropout in the low-rank path to align gradients and uses the delta AB as a surrogate for direct W updates, enabling richer representations without extra memory overhead. Across NLP tasks, Delta-LoRA achieves state-of-the-art results among PEFT methods on NLG and NLU benchmarks, supported by thorough ablations and gradient analyses. The approach offers a practical, memory-efficient way to boost fine-tuning performance for large language models.

Abstract

In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs). In contrast to LoRA and other low-rank adaptation methods such as AdaLoRA, Delta-LoRA not only updates the low-rank matrices $\bA$ and $\bB$, but also propagate the learning to the pre-trained weights $\bW$ via updates utilizing the delta of the product of two low-rank matrices ($\bA^{(t+1)}\bB^{(t+1)} - \bA^{(t)}\bB^{(t)}$). Such a strategy effectively addresses the limitation that the incremental update of low-rank matrices is inadequate for learning representations capable for downstream tasks. Moreover, as the update of $\bW$ does not need to compute the gradients of $\bW$ and store their momentums, Delta-LoRA shares comparable memory requirements and computational costs with LoRA. Extensive experiments show that Delta-LoRA significantly outperforms existing low-rank adaptation methods. We further support these results with comprehensive analyses that underscore the effectiveness of Delta-LoRA.

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

TL;DR

Delta-LoRA tackles the limited learning capacity of traditional low-rank adapters by updating the full pre-trained weights through the delta of the product of two low-rank matrices, while preserving LoRA-like memory costs. It removes dropout in the low-rank path to align gradients and uses the delta AB as a surrogate for direct W updates, enabling richer representations without extra memory overhead. Across NLP tasks, Delta-LoRA achieves state-of-the-art results among PEFT methods on NLG and NLU benchmarks, supported by thorough ablations and gradient analyses. The approach offers a practical, memory-efficient way to boost fine-tuning performance for large language models.

Abstract

In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs). In contrast to LoRA and other low-rank adaptation methods such as AdaLoRA, Delta-LoRA not only updates the low-rank matrices and , but also propagate the learning to the pre-trained weights via updates utilizing the delta of the product of two low-rank matrices (). Such a strategy effectively addresses the limitation that the incremental update of low-rank matrices is inadequate for learning representations capable for downstream tasks. Moreover, as the update of does not need to compute the gradients of and store their momentums, Delta-LoRA shares comparable memory requirements and computational costs with LoRA. Extensive experiments show that Delta-LoRA significantly outperforms existing low-rank adaptation methods. We further support these results with comprehensive analyses that underscore the effectiveness of Delta-LoRA.
Paper Structure (16 sections, 8 equations, 3 figures, 13 tables)

This paper contains 16 sections, 8 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: An overview of the proposed Delta-LoRA structure, compared to LoRA, DyLoRA and AdaLoRA. Note that DyLoRA and LoRA basically share the same architecture. $\boldsymbol{W}$ is the pre-trained weight which is frozen (signified by blue) when performing efficient-parameter fine-tuning in (a) and (b). Orange trapezoids $\boldsymbol{A}$, $\boldsymbol{B}$ and $\boldsymbol{E}$ denote the trainable parameters. In our proposed Delta-LoRA, the light orange rectangle means that pre-trained weights can be updated via the delta. Note that our proposed Delta-LoRA removes the Dropout layer to ensure reasonable delta for pre-trained matrix.
  • Figure 2: The framework of our proposed Delta-LoRA. The blue arrows represent forward pass while yellow dashed arrows denote backward propagation. The black solid arrows in (b) represent the process of updating the low-rank adaptation matrices $\boldsymbol{A}$ and $\boldsymbol{B}$ with normalized gradients $\widehat{\boldsymbol{g}}_{\boldsymbol{A}}$ and $\widehat{\boldsymbol{g}}_{\boldsymbol{B}}$ multiplied by the learning rate $\eta$, as well as updating the pre-trained weights $\boldsymbol{W}$ with the delta matrix $\triangle \boldsymbol{A}\boldsymbol{B}$ multiplied by the update ratio $\lambda$.
  • Figure 3: The comparison of Fine-Tuning$\ddag$, LoRA as well as Delta-LoRA for the cosine similarity between the fine-tuned parameters and the original pre-trained parameters in each transformer block. Higher value means higher similarity.