On the Crucial Role of Initialization for Matrix Factorization
Bingcong Li, Liang Zhang, Aryan Mokhtari, Niao He
TL;DR
The paper shows that initialization is a decisive factor in nonconvex matrix factorization, introducing Nyström initialization to drive ScaledGD from linear to quadratic convergence in symmetric exact/over-parametrized settings and enabling fast or even one-step convergence in asymmetric cases. It extends this initialization to LoRA, proposing NoRA and NoRA+ to improve efficiency and performance for finetuning large models across NLP and diffusion tasks. Theoretical results establish phase-based convergence improvements and flexibility across parametrization regimes, while empirical results demonstrate meaningful gains in few-shot learning, personalized image generation, commonsense reasoning, and math reasoning. Practically, Nyström-based initialization offers deployment-friendly benefits by preserving pretrained weights and avoiding costly decompositions, making it valuable for scalable parameter-efficient fine-tuning.
Abstract
This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.
