Table of Contents
Fetching ...

Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks

Zhenghao Xu, Yuqing Wang, Tuo Zhao, Rachel Ward, Molei Tao

TL;DR

This work studies the convergence behavior of gradient descent (GD) and Nesterov's accelerated gradient (NAG) for rectangular matrix factorization and its extension to linear neural networks. By employing an unbalanced initialization where $X_0$ is large and $Y_0=0$, the authors develop a contraction-subspace framework and prove that GD achieves $O(d^2(d-r+1)^{-2}\kappa^2\log(1/\epsilon))$ iterations while NAG attains a faster $O(d(d-r+1)^{-1}\kappa\log(1/\epsilon))$ iterations to reach a relative error of $\epsilon$, with high probability. They further extend the analysis to two-layer linear networks under an interpolation assumption, showing that NAG can realize accelerated linear convergence with comparatively modest width requirements. The results significantly improve the theoretical understanding of nonconvex optimization dynamics in rectangular matrix factorization and provide practical insights into training dynamics of linear neural networks, supported by numerical experiments that corroborate the theory. Overall, the paper demonstrates provable acceleration of NAG over GD for these nonconvex factorization problems and broadens applicability with extensions to linear networks.

Abstract

We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that gradient descent (GD) can find a pair of $ε$-optimal solutions $\mathbf{X}_T\in\mathbb{R}^{m\times d}$ and $\mathbf{Y}_T\in\mathbb{R}^{n\times d}$, where $d\geq r$, satisfying $\lVert\mathbf{X}_T\mathbf{Y}_T^\top-\mathbf{A}\rVert_\mathrm{F}\leqε\lVert\mathbf{A}\rVert_\mathrm{F}$ in $T=O(κ^2\log\frac{1}ε)$ iterations with high probability, where $κ$ denotes the condition number of $\mathbf{A}$. Furthermore, we prove that Nesterov's accelerated gradient (NAG) attains an iteration complexity of $O(κ\log\frac{1}ε)$, which is the best-known bound of first-order methods for rectangular matrix factorization. Different from small balanced random initialization in the existing literature, we adopt an unbalanced initialization, where $\mathbf{X}_0$ is large and $\mathbf{Y}_0$ is $0$. Moreover, our initialization and analysis can be further extended to linear neural networks, where we prove that NAG can also attain an accelerated linear convergence rate. In particular, we only require the width of the network to be greater than or equal to the rank of the output label matrix. In contrast, previous results achieving the same rate require excessive widths that additionally depend on the condition number and the rank of the input data matrix.

Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks

TL;DR

This work studies the convergence behavior of gradient descent (GD) and Nesterov's accelerated gradient (NAG) for rectangular matrix factorization and its extension to linear neural networks. By employing an unbalanced initialization where is large and , the authors develop a contraction-subspace framework and prove that GD achieves iterations while NAG attains a faster iterations to reach a relative error of , with high probability. They further extend the analysis to two-layer linear networks under an interpolation assumption, showing that NAG can realize accelerated linear convergence with comparatively modest width requirements. The results significantly improve the theoretical understanding of nonconvex optimization dynamics in rectangular matrix factorization and provide practical insights into training dynamics of linear neural networks, supported by numerical experiments that corroborate the theory. Overall, the paper demonstrates provable acceleration of NAG over GD for these nonconvex factorization problems and broadens applicability with extensions to linear networks.

Abstract

We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank- matrix , we prove that gradient descent (GD) can find a pair of -optimal solutions and , where , satisfying in iterations with high probability, where denotes the condition number of . Furthermore, we prove that Nesterov's accelerated gradient (NAG) attains an iteration complexity of , which is the best-known bound of first-order methods for rectangular matrix factorization. Different from small balanced random initialization in the existing literature, we adopt an unbalanced initialization, where is large and is . Moreover, our initialization and analysis can be further extended to linear neural networks, where we prove that NAG can also attain an accelerated linear convergence rate. In particular, we only require the width of the network to be greater than or equal to the rank of the output label matrix. In contrast, previous results achieving the same rate require excessive widths that additionally depend on the condition number and the rank of the input data matrix.

Paper Structure

This paper contains 29 sections, 21 theorems, 130 equations, 6 figures, 1 table.

Key Result

Theorem 1

For $0<\tau<c_1$, denote $\delta=3e^{-(d-r+1)\cdot\min\{\log\frac{1}{c_1\tau}, c_2, \frac{1}{2}\}}$, where $c_1$ and $c_2$ are universal constants. Denote $L=\sigma_1^2(\mathbf{X}_0)$, $\mu=\sigma_r^2(\mathbf{X}_0)$. Let $\eta=\frac{2}{L+\mu}$, $c\geq \underline{c}\coloneqq\frac{\sqrt{d}\sigma_r(\ma In particular, if $c=\underline{c}$, then GD finds $\left\lVert\mathbf{R}_T\right\rVert_\mathrm{F}\

Figures (6)

  • Figure 1: GD and AltGD achieve similar performance. The left plot is for \ref{['eq:MF']}, and the right plot is for \ref{['eq:LNN']}.
  • Figure 2: NAG converges faster than GD. The left plot is for \ref{['eq:MF']}, and the right plot is for \ref{['eq:LNN']}.
  • Figure 3: Comparison of predicted loss and numerical loss for matrix factorization. The left plot is for GD where $\kappa=10$, and the right plot is for GD and NAG where $\kappa=100$. (T) denotes theory prediction.
  • Figure 4: GD and NAG on large matrices exhibit similar behavior to small matrices in \ref{['fig:result-2']}. Left: matrix factorization with $m=1200$ and $n=1000$. Right: linear neural networks with $m=500$, $n=400$, $N=600$.
  • Figure 5: GD and NAG with different values of $c$. When $c$ is sufficiently large, changing its value would not significantly affect the convergence rate.
  • ...and 1 more figures

Theorems & Definitions (43)

  • Theorem 1: GD convergence rate
  • Theorem 2: NAG convergence rate
  • Proposition 1
  • Proposition 2: GD dynamics
  • Lemma 1: Eigensubspace
  • Lemma 2: GD contractivity
  • Lemma 3: Nonlinear error
  • Remark 1
  • Proposition 3: NAG dynamics
  • Lemma 4: NAG contractivity
  • ...and 33 more