Table of Contents
Fetching ...

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun

TL;DR

This work investigates why Muon-style optimizers paired with LoRA fine-tuning yield near-uniform growth of the LoRA adapter spectrum. By formulating a simplified LoRA-style matrix factorization and studying its continuous-time spectral gradient flow with a smoothed orthogonalization operator, the authors prove equal-rate dynamics: all active singular values grow at the same rate, so smaller singular values reach their targets first, contrasting with classic gradient flow. They further show that SpecGF converges to global minima from almost all bounded initializations, with global convergence guaranteed under $\ell_2$ regularization, and they validate the theory with matrix-factorization experiments and LLM fine-tuning. The results illuminate how orthogonalized updates can bias optimization toward isotropic spectral growth, informing optimizer design for large-scale fine-tuning and potentially improving generalization in spectral-subspace settings.

Abstract

Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

TL;DR

This work investigates why Muon-style optimizers paired with LoRA fine-tuning yield near-uniform growth of the LoRA adapter spectrum. By formulating a simplified LoRA-style matrix factorization and studying its continuous-time spectral gradient flow with a smoothed orthogonalization operator, the authors prove equal-rate dynamics: all active singular values grow at the same rate, so smaller singular values reach their targets first, contrasting with classic gradient flow. They further show that SpecGF converges to global minima from almost all bounded initializations, with global convergence guaranteed under regularization, and they validate the theory with matrix-factorization experiments and LLM fine-tuning. The results illuminate how orthogonalized updates can bias optimization toward isotropic spectral growth, informing optimizer design for large-scale fine-tuning and potentially improving generalization in spectral-subspace settings.

Abstract

Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.
Paper Structure (69 sections, 55 theorems, 285 equations, 17 figures)

This paper contains 69 sections, 55 theorems, 285 equations, 17 figures.

Key Result

Theorem 5.1

Assume $\left\langle\mathbf{v}, \mathbf{w}\right\rangle \ne 0$ and decompose $\mathbf{w}$ into $\mathbf{v}$ and $\mathbf{z}$ with $\left\langle\mathbf{v}, \mathbf{z}\right\rangle = 0$. Then, for all $t \geq 0$, If $\gamma > 0$ is sufficiently small, then under SpecGF with $\mathcal{T}_\beta$, Moreover, as $\gamma \to 0$, both $|a(t) - b(t)|$ and $\lvert\dot a(t) - \dot b(t)\rvert$ vanish to 0 at

Figures (17)

  • Figure 1: Evolution of the singular values of the LoRA $\mathbf{A}\mathbf{B}$ adapter applied to the query matrix in the first self-attention layer.
  • Figure 2: Comparison of singular value evolutions. While SpecGF induces uniform growth of spectrum of $\mathbf{A}\mathbf{B}$, GF induces the largest-first dynamics.
  • Figure 3: Loss comparison.
  • Figure 4: Comparison of SpecGF and vanilla GF on matrix factorization with a rank-$5$ target.
  • Figure 5: Lyapunov stability of global minima. Left: Loss trajectory. Right: Decrement of the distance from global minima. The attraction of the basin of some global minima is sufficiently large that, empirically, the origin lies in there.
  • ...and 12 more figures

Theorems & Definitions (95)

  • Theorem 5.1: Informal
  • Lemma 5.2: Diagonal approximation
  • Lemma 5.3: Initial growth
  • Proposition 5.4: High-probability alignment
  • Lemma 5.5: Persistence of Non-degeneracy
  • Lemma 5.6: Square-root dynamics
  • Theorem 5.7: Uniform growth
  • Corollary 5.8: Smallest singular value learns first
  • Proposition 6.1: Analyticity of $\mathcal{T}$ and $\mathcal{T}_\beta$
  • Theorem 6.2
  • ...and 85 more