Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Changmin Kang; Jihun Yun; Baekrok Shin; Yeseul Cho; Chulhee Yun

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun

TL;DR

This work investigates why Muon-style optimizers paired with LoRA fine-tuning yield near-uniform growth of the LoRA adapter spectrum. By formulating a simplified LoRA-style matrix factorization and studying its continuous-time spectral gradient flow with a smoothed orthogonalization operator, the authors prove equal-rate dynamics: all active singular values grow at the same rate, so smaller singular values reach their targets first, contrasting with classic gradient flow. They further show that SpecGF converges to global minima from almost all bounded initializations, with global convergence guaranteed under $\ell_2$ regularization, and they validate the theory with matrix-factorization experiments and LLM fine-tuning. The results illuminate how orthogonalized updates can bias optimization toward isotropic spectral growth, informing optimizer design for large-scale fine-tuning and potentially improving generalization in spectral-subspace settings.

Abstract

Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

TL;DR

regularization, and they validate the theory with matrix-factorization experiments and LLM fine-tuning. The results illuminate how orthogonalized updates can bias optimization toward isotropic spectral growth, informing optimizer design for large-scale fine-tuning and potentially improving generalization in spectral-subspace settings.

Abstract

regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

Paper Structure (69 sections, 55 theorems, 285 equations, 17 figures)

This paper contains 69 sections, 55 theorems, 285 equations, 17 figures.

Introduction
Related Works
Empirical Observation: Uniform Growth in LoRA with Muon
Experimental Setup.
LLM Fine-tuning Results.
Modeling for Theory.
Key Question.
Theoretical Setup
Notation.
Problem Setup
Uniform Growth of Singular Values
Alignment Yields Decoupled Dynamics
General case: Approximate as Near-Diagonal
Core Variables.
Key Concepts.
...and 54 more sections

Key Result

Theorem 5.1

Assume $\left\langle\mathbf{v}, \mathbf{w}\right\rangle \ne 0$ and decompose $\mathbf{w}$ into $\mathbf{v}$ and $\mathbf{z}$ with $\left\langle\mathbf{v}, \mathbf{z}\right\rangle = 0$. Then, for all $t \geq 0$, If $\gamma > 0$ is sufficiently small, then under SpecGF with $\mathcal{T}_\beta$, Moreover, as $\gamma \to 0$, both $|a(t) - b(t)|$ and $\lvert\dot a(t) - \dot b(t)\rvert$ vanish to 0 at

Figures (17)

Figure 1: Evolution of the singular values of the LoRA $\mathbf{A}\mathbf{B}$ adapter applied to the query matrix in the first self-attention layer.
Figure 2: Comparison of singular value evolutions. While SpecGF induces uniform growth of spectrum of $\mathbf{A}\mathbf{B}$, GF induces the largest-first dynamics.
Figure 3: Loss comparison.
Figure 4: Comparison of SpecGF and vanilla GF on matrix factorization with a rank-$5$ target.
Figure 5: Lyapunov stability of global minima. Left: Loss trajectory. Right: Decrement of the distance from global minima. The attraction of the basin of some global minima is sufficiently large that, empirically, the origin lies in there.
...and 12 more figures

Theorems & Definitions (95)

Theorem 5.1: Informal
Lemma 5.2: Diagonal approximation
Lemma 5.3: Initial growth
Proposition 5.4: High-probability alignment
Lemma 5.5: Persistence of Non-degeneracy
Lemma 5.6: Square-root dynamics
Theorem 5.7: Uniform growth
Corollary 5.8: Smallest singular value learns first
Proposition 6.1: Analyticity of $\mathcal{T}$ and $\mathcal{T}_\beta$
Theorem 6.2
...and 85 more

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

TL;DR

Abstract

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (95)