Table of Contents
Fetching ...

The Impact of Initialization on LoRA Finetuning Dynamics

Soufiane Hayou, Nikhil Ghosh, Bin Yu

TL;DR

This work reveals that two natural random initialization schemes for LoRA adapters—Init[A] (A random, B=0) and Init[B] (B random, A=0)—produce fundamentally different finetuning dynamics in the large-width limit. By combining a large-width γ-operator framework with a simplified single-LoRA-module setting, it shows Init[A] permits larger stable learning rates and more efficient feature learning at the cost of internal instability, while Init[B] remains stable but undertrains the B matrix, yielding suboptimal learning. The authors validate these predictions via teacher-student simulations and extensive experiments on RoBERTa and Llama models, demonstrating consistent empirical advantages of Init[A] in many tasks while acknowledging suboptimalities remain. The findings offer a zero-cost, practically actionable guidance: prefer Init[A] for LoRA finetuning and further motivate integrating LoRA with complementary efficiency methods to mitigate instability and undertraining concerns.

Abstract

In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

The Impact of Initialization on LoRA Finetuning Dynamics

TL;DR

This work reveals that two natural random initialization schemes for LoRA adapters—Init[A] (A random, B=0) and Init[B] (B random, A=0)—produce fundamentally different finetuning dynamics in the large-width limit. By combining a large-width γ-operator framework with a simplified single-LoRA-module setting, it shows Init[A] permits larger stable learning rates and more efficient feature learning at the cost of internal instability, while Init[B] remains stable but undertrains the B matrix, yielding suboptimal learning. The authors validate these predictions via teacher-student simulations and extensive experiments on RoBERTa and Llama models, demonstrating consistent empirical advantages of Init[A] in many tasks while acknowledging suboptimalities remain. The findings offer a zero-cost, practically actionable guidance: prefer Init[A] for LoRA finetuning and further motivate integrating LoRA with complementary efficiency methods to mitigate instability and undertraining concerns.

Abstract

In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
Paper Structure (53 sections, 3 theorems, 25 equations, 6 figures, 12 tables)

This paper contains 53 sections, 3 theorems, 25 equations, 6 figures, 12 tables.

Key Result

Lemma 1

For $t$ fixed, the asymptotic dynamics of $Z_A^t$ and $B_t$ follow the recursive formula

Figures (6)

  • Figure 1: Summary of our contributions in this paper: a description of the difference between the finetuning dynamics when LoRA weights $A$ and $B$ are initialized with Init[A] or Init[B].
  • Figure 2: Optimal Learning rate for the finetuning of synthetic model \ref{['eq:synthetic_model']} with Init[A] and Init[B] as initialization. The optimal LRs are shown as a function of width $n$. Theoretical lines $n^{-1}$ and $n^{-1/2}$ are shown as well (constants $C_1, C_2$ are chosen to provide suitable trend visualization). As model width $n$ grows, the optimal learning rate with Init[A] becomes larger than the optimal learning rate with Init[B]. This is in agreement with the theoretical results.
  • Figure 3: Evolution of the norms of the $Z_A, Z_B$ features, averaged over training data. We compute the average $\hat{|}Z_A| \overset{def}{=} N^{-1} \sum_{i=1}^N \|Z_A(x_i)\|$ (and same for $Z_B$), where the $x_i$'s are the training data. The dynamics are shown for widths $n=128$ and $n=8192$, two seeds, and for both Init[A] and Init[B]. Train loss and the (optimal) learning rate are shown on top of each plot. We observe that the magnitude of $Z_A$ is significantly higher with Init[A] compared to Init[B] at large width ($n=8192$). Interestingly, the train loss is smaller with Init[A], as compared to Init[B]. Results with other seeds and widths are shown in \ref{['app:add_exps']}.
  • Figure 4: Test Accuracy for RoBERTa-Large finetuned on GLUE tasks. The results are shown after convergence of finetuning with LoRA, initialized with either Init[A] or Init[B]. Models were finetuned using LoRA rank $r=8$ and FP16 precision. Optimal learning rate and corresponding accuracy are shown on top of each panel for both initializations. The experimental setup is provided in \ref{['app:add_exps']}.
  • Figure 5: (Left) Test perplexity (lower is better) of TinyLlama LoRA on WikiText-2 with Init[A] and Init[B]. (Center) MMLU accuracy of Llama-7b LoRA finetuned on the Flan-v2 dataset. (Right) GSM8k test accuracy of Llama-7b LoRA finetuned on the GSM8k dataset. More experimental details are provided in \ref{['app:add_exps']}.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Definition 1: Low Rank Adapters (LoRA) from hu2021lora
  • Definition 2: LoRA Features
  • Definition 3: Feature Stability
  • Definition 4: Feature Learning
  • Definition 5: Efficient Learning with LoRA
  • Lemma 1: Informal
  • Theorem 1: Informal
  • Theorem 2: Informal
  • proof
  • proof
  • ...and 1 more