Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices
Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, Lei Zhang
TL;DR
Delta-LoRA tackles the limited learning capacity of traditional low-rank adapters by updating the full pre-trained weights through the delta of the product of two low-rank matrices, while preserving LoRA-like memory costs. It removes dropout in the low-rank path to align gradients and uses the delta AB as a surrogate for direct W updates, enabling richer representations without extra memory overhead. Across NLP tasks, Delta-LoRA achieves state-of-the-art results among PEFT methods on NLG and NLU benchmarks, supported by thorough ablations and gradient analyses. The approach offers a practical, memory-efficient way to boost fine-tuning performance for large language models.
Abstract
In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs). In contrast to LoRA and other low-rank adaptation methods such as AdaLoRA, Delta-LoRA not only updates the low-rank matrices $\bA$ and $\bB$, but also propagate the learning to the pre-trained weights $\bW$ via updates utilizing the delta of the product of two low-rank matrices ($\bA^{(t+1)}\bB^{(t+1)} - \bA^{(t)}\bB^{(t)}$). Such a strategy effectively addresses the limitation that the incremental update of low-rank matrices is inadequate for learning representations capable for downstream tasks. Moreover, as the update of $\bW$ does not need to compute the gradients of $\bW$ and store their momentums, Delta-LoRA shares comparable memory requirements and computational costs with LoRA. Extensive experiments show that Delta-LoRA significantly outperforms existing low-rank adaptation methods. We further support these results with comprehensive analyses that underscore the effectiveness of Delta-LoRA.
