Table of Contents
Fetching ...

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

Shaowen Wang, Linxi Yu, Jian Li

TL;DR

LoRA-GA tackles the slow convergence of LoRA by an initialization that aligns the initial low-rank adapter updates with the full-model gradient. By solving a gradient-alignment objective via SVD and enforcing forward/backward scale stability, the method achieves convergence rates close to full fine-tuning while preserving LoRA's efficiency. Extensive experiments on GLUE with T5-Base and Llama 2-7B across NL, reasoning, and coding tasks demonstrate faster convergence and comparable or superior performance versus vanilla LoRA and full-finetuning, with robustness to rank settings. The approach requires minimal architectural changes and is compatible with existing LoRA variants, offering a practical route to accelerate PEFT for large models.

Abstract

Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance. Code is available at https://github.com/Outsider565/LoRA-GA.

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

TL;DR

LoRA-GA tackles the slow convergence of LoRA by an initialization that aligns the initial low-rank adapter updates with the full-model gradient. By solving a gradient-alignment objective via SVD and enforcing forward/backward scale stability, the method achieves convergence rates close to full fine-tuning while preserving LoRA's efficiency. Extensive experiments on GLUE with T5-Base and Llama 2-7B across NL, reasoning, and coding tasks demonstrate faster convergence and comparable or superior performance versus vanilla LoRA and full-finetuning, with robustness to rank settings. The approach requires minimal architectural changes and is compatible with existing LoRA variants, offering a practical route to accelerate PEFT for large models.

Abstract

Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance. Code is available at https://github.com/Outsider565/LoRA-GA.
Paper Structure (44 sections, 5 theorems, 21 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 44 sections, 5 theorems, 21 equations, 5 figures, 8 tables, 2 algorithms.

Key Result

Lemma 3.1

Suppose the loss function is $\mathcal{L}$ and $y = W'x = (W_0 + \eta BA)x$, where $y$ is the output of a layer and $x$ is the input, the gradients of $A$ and $B$ are linear mappings of the gradient of $W'$: Remarkably, $\nabla_{W'} \mathcal{L}$ in LoRA and $\nabla_{W} \mathcal{L}$ in full fine-tuning are equal at the beginning of the training.

Figures (5)

  • Figure 1: ( Left) Training loss curves of Llama 2-7B on MetaMathQA to training steps. LoRA-GA converges as quickly as full fine-tuning and outperforms LoRA. ( Right) Initialization procedures used in LoRA and LoRA-GA. The key difference is that LoRA-GA initializes adapters using the eigenvectors of the gradient matrix, as opposed to random initialization with a scaling factor.
  • Figure 2: (Left) Training loss curves of LoRA-GA with different ranks on the MetaMathQA dataset. Higher ranks result in faster loss reduction, approaching the performance of full fine-tuning. (Right) Training loss curves from the ablation study with different settings on the MetaMATHQA dataset. Compared to Vanilla LoRA, both components of LoRA-GA , +SO (stable output) and +GA (gradient approximation), improve convergence speed. LoRA-GA achieves the fastest convergence, closely matching that of full fine-tuning.
  • Figure 3: Training Loss curves of Full Fine-tuning, LoRA and LoRA-GA on different datasets.
  • Figure 4: Training Loss curves of different LoRA-GA ablations on different datasets.
  • Figure 5: ( Left) A gradient matrix of T5-Base during fine-tuning on CoLA. ( Middle) The decreasing curve of singular values of the gradient matrix. ( Right) The cumulative curve showing the coverage of squared singular values.

Theorems & Definitions (11)

  • Lemma 3.1
  • Theorem 3.1
  • Definition 3.1
  • Theorem 3.2
  • proof
  • proof
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • ...and 1 more