The Impact of Initialization on LoRA Finetuning Dynamics
Soufiane Hayou, Nikhil Ghosh, Bin Yu
TL;DR
This work reveals that two natural random initialization schemes for LoRA adapters—Init[A] (A random, B=0) and Init[B] (B random, A=0)—produce fundamentally different finetuning dynamics in the large-width limit. By combining a large-width γ-operator framework with a simplified single-LoRA-module setting, it shows Init[A] permits larger stable learning rates and more efficient feature learning at the cost of internal instability, while Init[B] remains stable but undertrains the B matrix, yielding suboptimal learning. The authors validate these predictions via teacher-student simulations and extensive experiments on RoBERTa and Llama models, demonstrating consistent empirical advantages of Init[A] in many tasks while acknowledging suboptimalities remain. The findings offer a zero-cost, practically actionable guidance: prefer Init[A] for LoRA finetuning and further motivate integrating LoRA with complementary efficiency methods to mitigate instability and undertraining concerns.
Abstract
In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
