Table of Contents
Fetching ...

Global Optimality of Local Search for Low Rank Matrix Recovery

Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

TL;DR

We address recovering a low-rank PSD matrix $\bm{X^*}$ from linear measurements via the non-convex factorization $\bm{X}=\bm{U}\bm{U}^T$ and objective $f(\bm{U})=\|\mathcal{A}(\bm{U}\bm{U}^T)-\bm{y}\|^2$. Under $(2r,\delta_{2r})$-RIP with $\delta_{2r}<1/5$, the landscape has no spurious local minima in the noiseless case, and in the presence of noise all local minima are close to the global optimum; saddle points have negative curvature, enabling SGD from random initialization to converge to a global optimum in polynomial time. The results extend to approximate low rank with explicit error bounds and show near-optimal sample complexity for Gaussian measurements, while also establishing the necessity of RIP by presenting counterexamples when RIP fails. Collectively, the work provides a theoretical justification for practical non-convex matrix factorization methods, bridging gaps between theory and practice and offering insight into optimization landscapes that may generalize to broader rank-constrained problems and deep networks.

Abstract

We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random initialization}.

Global Optimality of Local Search for Low Rank Matrix Recovery

TL;DR

We address recovering a low-rank PSD matrix from linear measurements via the non-convex factorization and objective . Under -RIP with , the landscape has no spurious local minima in the noiseless case, and in the presence of noise all local minima are close to the global optimum; saddle points have negative curvature, enabling SGD from random initialization to converge to a global optimum in polynomial time. The results extend to approximate low rank with explicit error bounds and show near-optimal sample complexity for Gaussian measurements, while also establishing the necessity of RIP by presenting counterexamples when RIP fails. Collectively, the work provides a theoretical justification for practical non-convex matrix factorization methods, bridging gaps between theory and practice and offering insight into optimization landscapes that may generalize to broader rank-constrained problems and deep networks.

Abstract

We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random initialization}.

Paper Structure

This paper contains 19 sections, 18 theorems, 71 equations, 3 figures.

Key Result

Theorem 3.1

Consider the optimization problem eq:prob2 where $\bm{y}=\mathcal{A}(\bm{X^*})+\bm{w}$, $\bm{w}$ is i.i.d. $\mathcal{N}(0,\sigma_w^2),$$\mathcal{A}$ satisfies $(2r,\delta_{2r})$-RIP with $\delta_{2r}<\frac{1}{10}$, and $\operatorname{rank}(\bm{X^*})\leq r$. Then, with probability $\geq 1 -\frac{10}{

Figures (3)

  • Figure 1: The plots in this figure compare the success probability of gradient descent between (left) random and (center) SVD initialization (suggested in jain2013low), for problem \ref{['eq:prob2']}, with increasing number of samples $m$ and various values of rank $r$. Right most plot is the first $m$ for a given $r$, where the probability of success reaches the value $0.5$. A run is considered success if $\|\bm{U} \bm{U}^\top -\bm{X^*}\|_F/\|\bm{X^*}\|_F \leq 1e-2$. White cells denote success and black cells denote failure of recovery. We set $n$ to be $100$. Measurements $y_i$ are inner product of entrywise i.i.d Gaussian matrix and a rank-$r$ p.s.d matrix with random subspace. We notice no significant difference between the two initialization methods, suggesting absence of local minima as shown. Both methods have phase transition around $m =2\cdot n\cdot r$.
  • Figure 2: This figure plots the success probability for increasing number of samples $m$ and various values of rank $r$. The plots on the top are for gradient descent, left for random initialization and the right for SVD initialization. Similarly the bottom plots are for the noisy gradient descent. We notice no significant difference between all these settings. They all have phase transition around $m =2\cdot n\cdot r$.
  • Figure 3: This figure plots the error $\|\widehat{U} \widehat{\bm{U}}^\top -\bm{X^*}\|_F/\|\bm{X^*}\|_F$ for increasing number of samples $m$. The left plot is for gradient descent with random initialization, center plot corresponds to SVD initialization. Again we notice no difference in error for these two settings. The rightmost figure shows phase transition of low rank recovery for all the different settings when $\bm{X^*}$ is rank 10.

Theorems & Definitions (33)

  • Definition 2.1: Restricted Isometry Property
  • Theorem 3.1
  • Theorem 3.2: Strict saddle
  • Theorem 3.3: Convergence from random initialization
  • Theorem 3.4
  • Theorem 4.1
  • Lemma 4.1
  • Lemma 4.2
  • proof : Proof of Lemma \ref{['lem:first']}
  • Lemma 4.3
  • ...and 23 more