Table of Contents
Fetching ...

Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization

Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, Rene Vidal

TL;DR

This work analyzes gradient-flow dynamics of LoRA-based fine-tuning for matrix factorization, revealing a two-phase learning process: an alignment phase where LoRA singular directions align with the fine-tuning target, and a local convergence phase with linear decay. It provides rigorous results showing that small initialization scales drive closer-to-optimal final error, and introduces a spectral initialization that enables convergence to arbitrary precision. The theory accounts for misalignment between pre-trained and fine-tuning tasks and the coupling with fixed pre-trained weights, and is corroborated by MF and image-classification experiments. The findings suggest initialization scale and spectral design crucially influence both optimization and generalization, with practical implications for efficient, accurate fine-tuning of large pre-trained models.

Abstract

Despite the empirical success of Low-Rank Adaptation (LoRA) in fine-tuning pre-trained models, there is little theoretical understanding of how first-order methods with carefully crafted initialization adapt models to new tasks. In this work, we take the first step towards bridging this gap by theoretically analyzing the learning dynamics of LoRA for matrix factorization (MF) under gradient flow (GF), emphasizing the crucial role of initialization. For small initialization, we theoretically show that GF converges to a neighborhood of the optimal solution, with smaller initialization leading to lower final error. Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix, and reducing the initialization scale improves alignment. To address this misalignment, we propose a spectral initialization for LoRA in MF and theoretically prove that GF with small spectral initialization converges to the fine-tuning task with arbitrary precision. Numerical experiments from MF and image classification validate our findings.

Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization

TL;DR

This work analyzes gradient-flow dynamics of LoRA-based fine-tuning for matrix factorization, revealing a two-phase learning process: an alignment phase where LoRA singular directions align with the fine-tuning target, and a local convergence phase with linear decay. It provides rigorous results showing that small initialization scales drive closer-to-optimal final error, and introduces a spectral initialization that enables convergence to arbitrary precision. The theory accounts for misalignment between pre-trained and fine-tuning tasks and the coupling with fixed pre-trained weights, and is corroborated by MF and image-classification experiments. The findings suggest initialization scale and spectral design crucially influence both optimization and generalization, with practical implications for efficient, accurate fine-tuning of large pre-trained models.

Abstract

Despite the empirical success of Low-Rank Adaptation (LoRA) in fine-tuning pre-trained models, there is little theoretical understanding of how first-order methods with carefully crafted initialization adapt models to new tasks. In this work, we take the first step towards bridging this gap by theoretically analyzing the learning dynamics of LoRA for matrix factorization (MF) under gradient flow (GF), emphasizing the crucial role of initialization. For small initialization, we theoretically show that GF converges to a neighborhood of the optimal solution, with smaller initialization leading to lower final error. Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix, and reducing the initialization scale improves alignment. To address this misalignment, we propose a spectral initialization for LoRA in MF and theoretically prove that GF with small spectral initialization converges to the fine-tuning task with arbitrary precision. Numerical experiments from MF and image classification validate our findings.

Paper Structure

This paper contains 39 sections, 26 theorems, 204 equations, 7 figures.

Key Result

Theorem 3.1

Assume $\delta_w\!\not=\!1$; without loss of generality take $\delta_w<1$. Then, for any LoRA rank $r$, there exists constants $c_1, c_2\!=\!\mathrm{polylog}(\frac{1}{\lvert1\!-\!\delta_w\rvert}, \sigma_{W_2}, \sigma_{W_1}),$ and $c_3, \alpha^*\!=\!\mathrm{polylog}(\lvert1\!-\!\delta_w\rvert, \sigm

Figures (7)

  • Figure 1: We simulate Problem \ref{['eqn:obj_lora']} in the context of $\delta_w\!<\!1$ using both small initialization (see §\ref{['sec:prelim']}) and small spectral initialization (see §\ref{['sec:rank-r']}). Each simulation is repeated thirty times, with shaded regions representing one standard deviation above and below the mean (see §\ref{['subsec:simulation_mf']} for details). The left panel shows the evolution of the loss for different initialization scales $\alpha$ with small and spectral initialization. The middle panel tracks the alignment quality between $\boldsymbol{U_{Z_1}^S}$ and $\boldsymbol{\gamma_1}$, measured by $\log_{10}(1 - \cos(\gamma_1, \boldsymbol{U_{Z_1}^S}(t)))$, where smaller values indicate better alignment. The right panel focuses on small initialization with $\alpha = 10^{-5}$, illustrating how the reconstruction loss, alignment between $\boldsymbol{U_{Z_1}^S}$ and $\boldsymbol{\gamma_1}$, and $\lVert Z_1 \rVert$ evolve during the alignment phase.
  • Figure 2: The left and middle panels report the loss and accuracy evaluated on the training and evaluation datasets for ResNet on the CIFAR-10 dataset. The right panel shows the evolution of the alignment between the singular matrices of the LoRA weights and the target directions (with smaller values indicating better alignment), as well as the norm of the LoRA weights. We repeat the simulation thirty times, with the shaded regions representing one standard deviation above and below the mean.
  • Figure 3: We simulate Problem \ref{['eqn:obj_lora']} in the context of $\delta_w\!<\!1$ using both small initialization (see §\ref{['sec:prelim']}) and small spectral initialization (see §\ref{['sec:rank-r']}). We generate the data $Y_{\mathrm{ft}} = Y_{\mathrm{pre}} + 5uv^\top$ where $u ,v$ is the top principle component of $Y_{\mathrm{pre}}$. Each simulation is repeated thirty times, with shaded regions representing one standard deviation above and below the mean (see §\ref{['subsec:simulation_mf']} for details). The left column shows the evolution of the loss for different initialization scales $\alpha$ with small and spectral initialization. The middle column tracks the alignment quality between $\boldsymbol{U_{Z_1}^S}$ and $\gamma_1$, measured by $\log_{10}(1 - \cos(\gamma_1, \boldsymbol{U_{Z_1}^S}(t)))$, where smaller values indicate better alignment. The right column focuses on small initialization with $\alpha = 10^{-5}$, illustrating how the reconstruction loss, alignment between $\boldsymbol{U_{Z_1}^S}$ and $\gamma_1$, and $\lVert Z_1 \rVert$ evolve during the alignment phase.
  • Figure 4: We simulate Problem \ref{['eqn:obj_lora']} in the context of $\delta_w\!<\!1$ using both small initialization (see §\ref{['sec:prelim']}) and small spectral initialization (see §\ref{['sec:rank-r']}). We generate the data $Y_{\mathrm{ft}} = Y_{\mathrm{pre}} + 5uv^\top$ where $u ,v$ is the bottom principle component of $Y_{\mathrm{pre}}$. Each simulation is repeated thirty times, with shaded regions representing one standard deviation above and below the mean (see §\ref{['subsec:simulation_mf']} for details). The left column shows the evolution of the loss for different initialization scales $\alpha$ with small and spectral initialization. The middle column tracks the alignment quality between $\boldsymbol{U_{Z_1}^S}$ and $\gamma_1$, measured by $\log_{10}(1 - \cos(\gamma_1, \boldsymbol{U_{Z_1}^S}(t)))$, where smaller values indicate better alignment. The right column focuses on small initialization with $\alpha = 10^{-5}$, illustrating how the reconstruction loss, alignment between $\boldsymbol{U_{Z_1}^S}$ and $\gamma_1$, and $\lVert Z_1 \rVert$ evolve during the alignment phase.
  • Figure 5: We run the experiments on fine-tuning ResNet, VIT and VGG pre-trained on ImageNet to MNIST and CIFAR10. We monitor the evolution of the alignment and norm of the LoRA weights in the early stage of training.
  • ...and 2 more figures

Theorems & Definitions (52)

  • Theorem 3.1
  • Claim 3.1
  • Claim 3.2
  • Theorem 3.2: Local convergence
  • Remark 4.1
  • Theorem 4.1
  • Lemma A.1
  • proof
  • Lemma A.2
  • Lemma A.3: Singular Space Perturbation Bound
  • ...and 42 more