Table of Contents
Fetching ...

On the Crucial Role of Initialization for Matrix Factorization

Bingcong Li, Liang Zhang, Aryan Mokhtari, Niao He

TL;DR

The paper shows that initialization is a decisive factor in nonconvex matrix factorization, introducing Nyström initialization to drive ScaledGD from linear to quadratic convergence in symmetric exact/over-parametrized settings and enabling fast or even one-step convergence in asymmetric cases. It extends this initialization to LoRA, proposing NoRA and NoRA+ to improve efficiency and performance for finetuning large models across NLP and diffusion tasks. Theoretical results establish phase-based convergence improvements and flexibility across parametrization regimes, while empirical results demonstrate meaningful gains in few-shot learning, personalized image generation, commonsense reasoning, and math reasoning. Practically, Nyström-based initialization offers deployment-friendly benefits by preserving pretrained weights and avoiding costly decompositions, making it valuable for scalable parameter-efficient fine-tuning.

Abstract

This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.

On the Crucial Role of Initialization for Matrix Factorization

TL;DR

The paper shows that initialization is a decisive factor in nonconvex matrix factorization, introducing Nyström initialization to drive ScaledGD from linear to quadratic convergence in symmetric exact/over-parametrized settings and enabling fast or even one-step convergence in asymmetric cases. It extends this initialization to LoRA, proposing NoRA and NoRA+ to improve efficiency and performance for finetuning large models across NLP and diffusion tasks. Theoretical results establish phase-based convergence improvements and flexibility across parametrization regimes, while empirical results demonstrate meaningful gains in few-shot learning, personalized image generation, commonsense reasoning, and math reasoning. Practically, Nyström-based initialization offers deployment-friendly benefits by preserving pretrained weights and avoiding costly decompositions, making it valuable for scalable parameter-efficient fine-tuning.

Abstract

This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.

Paper Structure

This paper contains 54 sections, 26 theorems, 76 equations, 7 figures, 10 tables.

Key Result

Lemma 1

For some universal constant $\tau > 0$, $\sigma_r(\mathbf{X}_0) \geq \xi \tau (\sqrt{r_A} - \sqrt{r-1}) \sigma_{r_A}(\mathbf{A})$ is satisfied with high probability, i.e., $\text{rank}(\mathbf{X}_0)=r$ w.h.p.

Figures (7)

  • Figure 1: Convergence of ScaledGD under Nyström initialization (optimality error vs. iteration) in different settings. (a) Comparison of GD, and ScaledGD with small / Nyström initialization (ours). (b) Solid lines show that our initialization is not sensitive to magnitude of $\xi$; and dotted lines illustrate that quadratic convergence cannot be obtained after perturbing the initialization, i.e., $\mathbf{X}_0 = \mathbf{A} \mathbf{\Omega} + \mathbf{N}$, where $[\mathbf{N}]_{ij}~\sim {\cal N}(0, \xi_n^2)$. (c) Comparison of ScaledGD under Nyström initialization with various $\eta$.
  • Figure 2: Which singular values have the largest change after finetuning with LoRA of rank $r$? Orange: top-$r$ singular values; blue: other singular values. Note that here we only plot the first 64 singular values as others rarely have sufficiently large change.
  • Figure 3: Generated images from NoRA and NoRA+ with stable-diffusion.
  • Figure 4: Convergence of ScaledGD under Nyström initialization (optimality error vs. iteration) on over-parametrized problems detailed in Apdx. \ref{['apdx.sec.synthetic']}. (a) Comparison of GD, ScaledGD-$(\lambda)$ with small initialization, and ScaledGD with our initialization. (b) Solid lines show that our initialization is not sensitive to magnitude; and dotted lines illustrate that quadratic convergence cannot be obtained even with slightly perturbed initialization, i.e., $\mathbf{X}_0 = \mathbf{A} \mathbf{\Omega} + \mathbf{N}$, where $[\mathbf{N}]_{ij}~\sim {\cal N}(0, \xi_n^2)$.
  • Figure 5: The dog dataset.
  • ...and 2 more figures

Theorems & Definitions (53)

  • Lemma 1: Initialization for exact- and under- parametrization
  • Lemma 2
  • Theorem 1
  • Definition 1: Weak optimality
  • Lemma 3
  • Lemma 4
  • Theorem 2
  • Lemma 5
  • Lemma 6
  • Theorem 3: One-step convergence
  • ...and 43 more