Table of Contents
Fetching ...

LoRA Training in the NTK Regime has No Spurious Local Minima

Uijeong Jang, Jason D. Lee, Ernest K. Ryu

TL;DR

This work analyzes LoRA fine-tuning of pretrained transformers within the NTK (lazy) regime, proving that full fine-tuning admits a low-rank solution with $ rac{r(r+1)}{2}\le KN$ and that LoRA with rank $ rac{r(r+1)}{2}> KN$ eliminates spurious local minima, enabling gradient-based methods to find low-rank solutions. It further shows that the obtained low-rank LoRA solution generalizes well via Rademacher-based bounds, and validates the theory with experiments across NLP, image, and speech tasks, observing convergence to a global optimum and rank-dependent training dynamics. The results provide upper-bound guarantees on trainability and generalization, connecting LoRA’s practical effectiveness to NTK-based optimization geometry and matrix-factorization techniques. Overall, the paper offers a principled explanation for LoRA’s success and motivates future work on refined rank bounds, local analysis beyond NTK, and computational tradeoffs in rank selection.

Abstract

Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with $N$ data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank $r\lesssim \sqrt{N}$; (ii) using LoRA with rank $r\gtrsim \sqrt{N}$ eliminates spurious local minima, allowing gradient descent to find the low-rank solutions; (iii) the low-rank solution found using LoRA generalizes well.

LoRA Training in the NTK Regime has No Spurious Local Minima

TL;DR

This work analyzes LoRA fine-tuning of pretrained transformers within the NTK (lazy) regime, proving that full fine-tuning admits a low-rank solution with and that LoRA with rank eliminates spurious local minima, enabling gradient-based methods to find low-rank solutions. It further shows that the obtained low-rank LoRA solution generalizes well via Rademacher-based bounds, and validates the theory with experiments across NLP, image, and speech tasks, observing convergence to a global optimum and rank-dependent training dynamics. The results provide upper-bound guarantees on trainability and generalization, connecting LoRA’s practical effectiveness to NTK-based optimization geometry and matrix-factorization techniques. Overall, the paper offers a principled explanation for LoRA’s success and motivates future work on refined rank bounds, local analysis beyond NTK, and computational tradeoffs in rank selection.

Abstract

Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank ; (ii) using LoRA with rank eliminates spurious local minima, allowing gradient descent to find the low-rank solutions; (iii) the low-rank solution found using LoRA generalizes well.
Paper Structure (35 sections, 19 theorems, 114 equations, 4 figures, 4 tables)

This paper contains 35 sections, 19 theorems, 114 equations, 4 figures, 4 tables.

Key Result

Lemma 2.2

Let $r>0$. For $\boldsymbol{\delta}\in \mathbb{R}^{m\times n}$ such that $\mathrm{rank}(\boldsymbol{\delta})\leq r$,

Figures (4)

  • Figure 1: Geometric intuition of Theorem \ref{['thm:existence']}. The three dimensional space describes the space of 2 by 2 matrices $1xyz$. The surface $z=xy$ represents the rank 1 matrices. The blue region on the surface correspond to the region of smaller objective values, and the set of global minima are depicted with purple. (Left) Plot of \ref{['eq:ex1']} with $N=1$. The set of global minima is a plane, and the intersection with the surface $z=xy$ (curve) is the set of rank-$1$ global minima. (Middle) Plot of \ref{['eq:ex2']} with $N=2$. the set of global minima is a line, and the intersection with the surface (two dots) is the set of rank 1 global minima. (Right) Plot of \ref{['eq:ex3']} with $N=3$. The set of global minima is a line, and there is no intersection with the surface, i.e., there is no global minimum of rank-$1$ but admits a rank-$2$ global minima.
  • Figure 2: Training curves (training loss vs. epochs) on different NLP tasks.
  • Figure 3: Training curves (training loss vs. epochs) on image and speech classification tasks.
  • Figure 4: Test curves (accuracy vs. epochs) on different NLP tasks. We used the LoRA rank of 16.

Theorems & Definitions (34)

  • Lemma 2.2: Lemma 5.1 of recht2010guaranteed
  • Theorem 2.3: Theorem 4.1 of lee2016gradient
  • Theorem 2.4: Informal, Theorem 1 of ge2015escaping
  • Theorem 3.1
  • proof : Proof sketch of Theorem \ref{['thm:existence']}
  • Theorem 4.1
  • Corollary 4.2
  • Lemma 4.3
  • Lemma 4.4
  • proof
  • ...and 24 more