LoRA Training in the NTK Regime has No Spurious Local Minima

Uijeong Jang; Jason D. Lee; Ernest K. Ryu

LoRA Training in the NTK Regime has No Spurious Local Minima

Uijeong Jang, Jason D. Lee, Ernest K. Ryu

TL;DR

This work analyzes LoRA fine-tuning of pretrained transformers within the NTK (lazy) regime, proving that full fine-tuning admits a low-rank solution with $rac{r(r+1)}{2}\le KN$ and that LoRA with rank $rac{r(r+1)}{2}> KN$ eliminates spurious local minima, enabling gradient-based methods to find low-rank solutions. It further shows that the obtained low-rank LoRA solution generalizes well via Rademacher-based bounds, and validates the theory with experiments across NLP, image, and speech tasks, observing convergence to a global optimum and rank-dependent training dynamics. The results provide upper-bound guarantees on trainability and generalization, connecting LoRA’s practical effectiveness to NTK-based optimization geometry and matrix-factorization techniques. Overall, the paper offers a principled explanation for LoRA’s success and motivates future work on refined rank bounds, local analysis beyond NTK, and computational tradeoffs in rank selection.

Abstract

Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with $N$ data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank $r\lesssim \sqrt{N}$; (ii) using LoRA with rank $r\gtrsim \sqrt{N}$ eliminates spurious local minima, allowing gradient descent to find the low-rank solutions; (iii) the low-rank solution found using LoRA generalizes well.

LoRA Training in the NTK Regime has No Spurious Local Minima

TL;DR

This work analyzes LoRA fine-tuning of pretrained transformers within the NTK (lazy) regime, proving that full fine-tuning admits a low-rank solution with

and that LoRA with rank

eliminates spurious local minima, enabling gradient-based methods to find low-rank solutions. It further shows that the obtained low-rank LoRA solution generalizes well via Rademacher-based bounds, and validates the theory with experiments across NLP, image, and speech tasks, observing convergence to a global optimum and rank-dependent training dynamics. The results provide upper-bound guarantees on trainability and generalization, connecting LoRA’s practical effectiveness to NTK-based optimization geometry and matrix-factorization techniques. Overall, the paper offers a principled explanation for LoRA’s success and motivates future work on refined rank bounds, local analysis beyond NTK, and computational tradeoffs in rank selection.

Abstract

data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank

; (ii) using LoRA with rank

eliminates spurious local minima, allowing gradient descent to find the low-rank solutions; (iii) the low-rank solution found using LoRA generalizes well.

Paper Structure (35 sections, 19 theorems, 114 equations, 4 figures, 4 tables)

This paper contains 35 sections, 19 theorems, 114 equations, 4 figures, 4 tables.

Introduction
Contribution.
Prior works
Theory of neural networks.
Neural tangent kernels.
Theory of transformers and LLMs.
PEFT methods and LoRA.
Matrix factorization.
Organization
Problem setting and preliminaries
Matrix notation.
Neural network.
Fine-tuning loss.
NTK regime.
LoRA.
...and 20 more sections

Key Result

Lemma 2.2

Let $r>0$. For $\boldsymbol{\delta}\in \mathbb{R}^{m\times n}$ such that $\mathrm{rank}(\boldsymbol{\delta})\leq r$,

Figures (4)

Figure 1: Geometric intuition of Theorem \ref{['thm:existence']}. The three dimensional space describes the space of 2 by 2 matrices $1xyz$. The surface $z=xy$ represents the rank 1 matrices. The blue region on the surface correspond to the region of smaller objective values, and the set of global minima are depicted with purple. (Left) Plot of \ref{['eq:ex1']} with $N=1$. The set of global minima is a plane, and the intersection with the surface $z=xy$ (curve) is the set of rank-$1$ global minima. (Middle) Plot of \ref{['eq:ex2']} with $N=2$. the set of global minima is a line, and the intersection with the surface (two dots) is the set of rank 1 global minima. (Right) Plot of \ref{['eq:ex3']} with $N=3$. The set of global minima is a line, and there is no intersection with the surface, i.e., there is no global minimum of rank-$1$ but admits a rank-$2$ global minima.
Figure 2: Training curves (training loss vs. epochs) on different NLP tasks.
Figure 3: Training curves (training loss vs. epochs) on image and speech classification tasks.
Figure 4: Test curves (accuracy vs. epochs) on different NLP tasks. We used the LoRA rank of 16.

Theorems & Definitions (34)

Lemma 2.2: Lemma 5.1 of recht2010guaranteed
Theorem 2.3: Theorem 4.1 of lee2016gradient
Theorem 2.4: Informal, Theorem 1 of ge2015escaping
Theorem 3.1
proof : Proof sketch of Theorem \ref{['thm:existence']}
Theorem 4.1
Corollary 4.2
Lemma 4.3
Lemma 4.4
proof
...and 24 more

LoRA Training in the NTK Regime has No Spurious Local Minima

TL;DR

Abstract

LoRA Training in the NTK Regime has No Spurious Local Minima

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (34)