Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization
Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun
TL;DR
This work investigates why Muon-style optimizers paired with LoRA fine-tuning yield near-uniform growth of the LoRA adapter spectrum. By formulating a simplified LoRA-style matrix factorization and studying its continuous-time spectral gradient flow with a smoothed orthogonalization operator, the authors prove equal-rate dynamics: all active singular values grow at the same rate, so smaller singular values reach their targets first, contrasting with classic gradient flow. They further show that SpecGF converges to global minima from almost all bounded initializations, with global convergence guaranteed under $\ell_2$ regularization, and they validate the theory with matrix-factorization experiments and LLM fine-tuning. The results illuminate how orthogonalized updates can bias optimization toward isotropic spectral growth, informing optimizer design for large-scale fine-tuning and potentially improving generalization in spectral-subspace settings.
Abstract
Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.
