Table of Contents
Fetching ...

$k$-SVD with Gradient Descent

Yassir Jedra, Devavrat Shah

TL;DR

This work develops a gradient-descent method for computing the leading $k$-SVD of a matrix $M$ of rank $d$, using a simple, parameter-free step-size and random initialization. The authors prove global linear convergence by showing the iterates enter an attracting region where the dynamics emulate Heron’s method for the top singular value, and extend the approach to sequentially recover $\sigma_1,\dots,\sigma_k$ and $u_1,\dots,u_k$, including under-parameterized cases ($k\le d$). They further introduce acceleration via Nesterov’s method to improve rates and demonstrate favorable empirical performance on synthetic and real-data matrices, with runtimes competitive to Lanczos-based approaches. The results offer a scalable, robust alternative for large-scale $k$-SVD and deepen understanding of gradient-based methods for nonconvex matrix factorization, including the role of preconditioning and geometric regions of attraction.

Abstract

The emergence of modern compute infrastructure for iterative optimization has led to great interest in developing optimization-based approaches for a scalable computation of $k$-SVD, i.e., the $k\geq 1$ largest singular values and corresponding vectors of a matrix of rank $d \geq 1$. Despite lots of exciting recent works, all prior works fall short in this pursuit. Specifically, the existing results are either for the exact-parameterized (i.e., $k = d$) and over-parameterized (i.e., $k > d$) settings; or only establish local convergence guarantees; or use a step-size that requires problem-instance-specific oracle-provided information. In this work, we complete this pursuit by providing a gradient-descent method with a simple, universal rule for step-size selection (akin to pre-conditioning), that provably finds $k$-SVD for a matrix of any rank $d \geq 1$. We establish that the gradient method with random initialization enjoys global linear convergence for any $k, d \geq 1$. Our convergence analysis reveals that the gradient method has an attractive region, and within this attractive region, the method behaves like Heron's method (a.k.a. the Babylonian method). Our analytic results about the said attractive region imply that the gradient method can be enhanced by means of Nesterov's momentum-based acceleration technique. The resulting improved convergence rates match those of rather complicated methods typically relying on Lanczos iterations or variants thereof. We provide an empirical study to validate the theoretical results.

$k$-SVD with Gradient Descent

TL;DR

This work develops a gradient-descent method for computing the leading -SVD of a matrix of rank , using a simple, parameter-free step-size and random initialization. The authors prove global linear convergence by showing the iterates enter an attracting region where the dynamics emulate Heron’s method for the top singular value, and extend the approach to sequentially recover and , including under-parameterized cases (). They further introduce acceleration via Nesterov’s method to improve rates and demonstrate favorable empirical performance on synthetic and real-data matrices, with runtimes competitive to Lanczos-based approaches. The results offer a scalable, robust alternative for large-scale -SVD and deepen understanding of gradient-based methods for nonconvex matrix factorization, including the role of preconditioning and geometric regions of attraction.

Abstract

The emergence of modern compute infrastructure for iterative optimization has led to great interest in developing optimization-based approaches for a scalable computation of -SVD, i.e., the largest singular values and corresponding vectors of a matrix of rank . Despite lots of exciting recent works, all prior works fall short in this pursuit. Specifically, the existing results are either for the exact-parameterized (i.e., ) and over-parameterized (i.e., ) settings; or only establish local convergence guarantees; or use a step-size that requires problem-instance-specific oracle-provided information. In this work, we complete this pursuit by providing a gradient-descent method with a simple, universal rule for step-size selection (akin to pre-conditioning), that provably finds -SVD for a matrix of any rank . We establish that the gradient method with random initialization enjoys global linear convergence for any . Our convergence analysis reveals that the gradient method has an attractive region, and within this attractive region, the method behaves like Heron's method (a.k.a. the Babylonian method). Our analytic results about the said attractive region imply that the gradient method can be enhanced by means of Nesterov's momentum-based acceleration technique. The resulting improved convergence rates match those of rather complicated methods typically relying on Lanczos iterations or variants thereof. We provide an empirical study to validate the theoretical results.

Paper Structure

This paper contains 28 sections, 31 theorems, 187 equations, 1 figure, 3 tables, 2 algorithms.

Key Result

Theorem 1

Let $\epsilon> 0$ and $M \in \mathbb{R}^{n\times n}$ be a symmetric, positive semi-definite with $\sigma_1 - \sigma_2 > 0$. Running gradient descent iterations as described in eq:gradient with the choice $\eta=1/2$, ensures that for $t \ge 1$, $\vert \Vert x_t \Vert^2 - \sigma_1 \vert \le \epsilon \ where $c_1, c_2$ are constants that only depend on the initial point $x_0$; with the random initial

Figures (1)

  • Figure 1: Here, we illustrate the runtime and convergence performance of GDSVD in both C and Python as we vary the gap $\sigma_1 - \sigma_2$ in the rank-$2$ setting. The implementations of GDSVD with different values of $\eta \in (0,1)$ were compared with Power Method. The curves were averaged over values of $n \in \lbrace 50, 100, 200, \dots, 1000 \rbrace$, and the shaded areas correspond to the standard deviations. The doted plots correspond to the lowest and uppermost performance over different values of $n$.

Theorems & Definitions (56)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Proposition 1
  • Lemma 1: Region of attraction
  • Lemma 2: Avoiding saddle points
  • Lemma 3
  • Lemma 4
  • Theorem 4
  • proof : Proof of Proposition \ref{['prop:grad convergence rank 1']}
  • ...and 46 more