Table of Contents
Fetching ...

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

Hai Huang, Randall Balestriero

TL;DR

It is proved that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes, and an elegant solution is found: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA.

Abstract

Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, $AB$, of a pretrained matrix parameter $W$ to align the model to a new task or dataset with $W+AB$. We identify three core limitations to LoRA for finetuning--a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA's initialization of $B$ at $0$ creates a slow training dynamic between $A$ and $B$. That dynamic is also exacerbated by Dropout that further slows the escape from $0$ for $B$ which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted'' interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters' $\ell_2$ norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

TL;DR

It is proved that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes, and an elegant solution is found: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA.

Abstract

Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, , of a pretrained matrix parameter to align the model to a new task or dataset with . We identify three core limitations to LoRA for finetuning--a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA's initialization of at creates a slow training dynamic between and . That dynamic is also exacerbated by Dropout that further slows the escape from for which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted'' interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters' norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.

Paper Structure

This paper contains 23 sections, 1 theorem, 16 equations, 6 figures, 7 tables.

Key Result

Proposition 1

(Ripple Effect) In the worst case, a constant scaling factor $\eta$ may cause the final output of a single forward pass of a LoRA finetuned model to grow exponentially w.r.t. the number of layers in the model.

Figures (6)

  • Figure 1: We depict the absolute difference ( y-axis) between the empirical and expected finetuning LoRA loss with varying Dropout rates ( rows) on different datasets ( columns) as a function of the number of Dropout realisation ( x-axis). We observe that regardless of the dataset and Dropout probability, the empirical error is a poor estimate of the true expected loss even after hundreds of averaged realisations. Hence, finetuning with Dropout produces a large amount of random noise that go well beyond its regularization benefit which only emerges after a large number of steps. That finding is also confirmed by the LLM experiment in \ref{['fig:accuracy-epoch-dropout']}.
  • Figure 2: Depiction of the distribution of standard deviation of gradients ( y-axis) w.r.t. the second layer of a MLP trained for MNIST ( left) and CIFAR10 ( right) classification, equipped with Dropout. At each training epoch ( x-axis), we consider a single mini-batch and compute the gradients under numerous Dropout realisation. For each entry in the matrix of gradients, we compute the standard deviation and report the distribution over entries. We clearly see that while the average variance of the gradient decreases slightly during training, the tail significantly increases, leading to unstable training in finetuning regimes.
  • Figure 3: Depiction of the norm of ${\bm{A}}$ ( left), the norm of ${\bm{B}}$ ( middle) and training loss ( right) for a MNIST LoRA finetuning experiment. We see that as training progresses ( x-axis) as the impact of increased Dropout probability ( colors) has a disproportionate regularization impact on ${\bm{B}}$ while barely impacting the norm of ${\bm{A}}$, indicating an asymmetry in Dropout's implicit regularization that makes LoRA slow to train.
  • Figure 4: Test set performance gap ( y-axis) between the close-form Dropout regularization and its empirical estimate as a function of training steps ( x-axis). We observe that the benefit of Dropout as a regularizer falls short for finetuning (small number of training steps) compared to pretraining regimes (large number of trainign steps).
  • Figure 5: Left: LoRA with varying Dropout rates: High value of Dropout provides the strongest performance after long fine-tuning and the weakest performance after short fine-tuning. Each line is an average of 3 runs. X-axis is epochs, and Y-axis is accuracy. Right: ALLoRA escapes from 0 rapidly, and then tapers off into a measured move. The starting phase matches that of LoRA with a much higher learning rate. LoRA with a lower learning rate can reach the same level of $L^2$ norm but much slower. This finding echoes \ref{['fig:convergence_lora']} that showed how Dropout's induced noise does not converge until long training is employed.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1
  • Proposition 1
  • Definition 2
  • proof