Table of Contents
Fetching ...

Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification

Gavin Zhang, Salar Fattahi, Richard Y. Zhang

TL;DR

This work tackles minimizing f(X)=φ(XX^{T}) over X∈R^{n×r} in the overparameterized Burer--Monteiro setting by introducing a preconditioned gradient descent (PrecGD) with updates X_{+}=X-α∇f(X)(X^{T}X+ηI)^{-1}. PrecGD achieves linear convergence independent of the rank deficiency λ_{ ext{min}}(X^{T}X) in the overparameterized regime, and a fixed η chosen relative to the current error ensures stability and good conditioning. The authors also prove global convergence under a perturbed framework (PMGD) and provide a posteriori certificates of global optimality via rank deficiency, along with thorough numerical demonstrations on problems like low-rank matrix recovery, 1-bit matrix sensing, and phase retrieval. Overall, the paper offers a practical, theory-backed method to retain fast convergence and certify optimality in overparameterized nonconvex factorizations, with favorable performance on large-scale problems where ill-conditioning would otherwise hinder progress.

Abstract

We consider using gradient descent to minimize the nonconvex function $f(X)=φ(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $φ$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $φ$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.

Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification

TL;DR

This work tackles minimizing f(X)=φ(XX^{T}) over X∈R^{n×r} in the overparameterized Burer--Monteiro setting by introducing a preconditioned gradient descent (PrecGD) with updates X_{+}=X-α∇f(X)(X^{T}X+ηI)^{-1}. PrecGD achieves linear convergence independent of the rank deficiency λ_{ ext{min}}(X^{T}X) in the overparameterized regime, and a fixed η chosen relative to the current error ensures stability and good conditioning. The authors also prove global convergence under a perturbed framework (PMGD) and provide a posteriori certificates of global optimality via rank deficiency, along with thorough numerical demonstrations on problems like low-rank matrix recovery, 1-bit matrix sensing, and phase retrieval. Overall, the paper offers a practical, theory-backed method to retain fast convergence and certify optimality in overparameterized nonconvex factorizations, with favorable performance on large-scale problems where ill-conditioning would otherwise hinder progress.

Abstract

We consider using gradient descent to minimize the nonconvex function over an factor matrix , in which is an underlying smooth convex cost function defined over matrices. While only a second-order stationary point can be provably found in reasonable time, if is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank of the current iterate to be overparameterized with respect to the rank of the global minimizer . Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with to a sublinear rate when , even when is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer .
Paper Structure (34 sections, 31 theorems, 174 equations, 7 figures)

This paper contains 34 sections, 31 theorems, 174 equations, 7 figures.

Key Result

Theorem 4

Let $\phi$ be $L_{1}$-gradient Lipschitz and $(\mu,2r)$-restricted strongly convex, and let $M^{\star}=\arg\min\phi$ satisfy $M^{\star}=X^{\star}X^{\star T}$ and $r^{\star}=\operatorname{rank}(M^{\star})\le r$. Define $f(X)\overset{\text{def}}{=}\phi(XX^{T})$; if $X$ is sufficiently close to global and if $\eta$ is bounded from above and below by the distance to the global optimizer then PrecGD

Figures (7)

  • Figure 1: PrecGD converges linearly in the overparameterized regime. Comparison of \ref{['PrecGD']} against regular gradient descent (GD), and the ScaledGD algorithm of tong2020accelerating for an instance of (\ref{['eq:ncvx']}) taken from NEURIPS2018_f8da71e5zhang2019sharp. The same initial points and the same step-size $\alpha=2\times10^{-2}$ was used for all three algorithms. (Left $r=r^{*}$) Set $n=4$ and $r^{*}=r=2$. All three methods convergence at a linear rate, though GD converges at a slower rate due to ill-conditioning in the ground truth. (Right $r>r^{*}$) With $n=4$, $r=4$ and $r^{*}=2$, overparameterization causes gradient descent to slow down to a sublinear rate. ScaledGD also behaves sporadically. Only PrecGD converges linearly to the global minimum.
  • Figure 2: Low-rank matrix recovery with $\ell_{2}$ loss. First row: Well-conditioned ($\kappa=1$), rank-2 ground truth of size $100\times100$. The left panel shows the performance of GD and PrecGD for $r=r^* = 2$. Both algorithms converge linearly to machine error. The right panel shows the performance of GD and PrecGD for $r = 4$. The overparameterized GD converges sublinearly, while PrecGD maintains the same converge rate. Second row: Ill-conditioned ($\kappa=5$), rank-2 ground truth of size $100\times100$. The left panel shows the performance of GD and PrecGD for $r=r^* = 2$. GD stagnates due to ill-conditioning while PrecGD converges linearly. The right panel shows the performance of GD and PrecGD for $r = 4$. The overparameterized GD continues to stagnate, while PrecGD maintains the same linear convergence rate.
  • Figure 3: PrecGD, ScaledGD and GD with random initialization. Comparison of PrecGD against regular gradient descent (GD), and the ScaledGD algorithm. All three methods uses the same global Gaussian random initialization. The same step-size $\alpha=2\times10^{-3}$ was used for all three algorithms. With $n=4$, $r=4$ and $r^{*}=2$, overparameterization causes gradient descent to slow down to a sublinear rate. ScaledGD behaves sporadically and diverges. Only PrecGD converges linearly to the global minimum.
  • Figure 4: 1-bit matrix sensing. First row: Well-conditioned ($\kappa=1$), rank-2 ground truth of size $100\times100$. The left panel shows the performance of GD and PrecGD for $r=r^* = 2$. Both algorithms converge linearly to machine error. The right panel shows the performance of GD and PrecGD for $r = 4$. The overparameterized GD converges sublinearly, while PrecGD maintains the same converge rate. Second row: Ill-conditioned ($\kappa=10$), rank-2 ground truth of size $100\times100$. The left panel shows the performance of GD and PrecGD for $r=r^* = 2$. GD stagnates due to ill-conditioning while PrecGD converges linearly. The right panel shows the performance of GD and PrecGD for $r = 4$. The overparameterized GD continues to stagnate, while PrecGD maintains the same linear convergence rate.
  • Figure 5: Phase retrieval. First row: Well-conditioned ($\kappa=1$), rank-2 ground truth of size $100\times100$. The left panel shows the performance of GD and PrecGD for $r=r^* = 2$. Both algorithms converge linearly to machine error. The right panel shows the performance of GD and PrecGD for $r = 4$. The overparameterized GD converges sublinearly, while PrecGD maintains the same converge rate. Second row: Ill-conditioned ($\kappa=5$), rank-2 ground truth of size $100\times100$. The left panel shows the performance of GD and PrecGD for $r=r^* = 2$. GD stagnates due to ill-conditioning while PrecGD converges linearly. The right panel shows the performance of GD and PrecGD for $r = 4$. The overparameterized GD continues to stagnate, while PrecGD maintains the same linear convergence rate.
  • ...and 2 more figures

Theorems & Definitions (38)

  • Definition 1: Gradient Lipschitz
  • Definition 2: Strong convexity
  • Remark 3
  • Theorem 4: Linear convergence
  • Corollary 5: Optimal parameter
  • Definition 6: Hessian Lipschitz
  • Definition 7: Strict saddle property
  • Theorem 8: Approximate second-order optimality
  • Corollary 9: Global convergence
  • Proposition 10: Certificate of global optimality
  • ...and 28 more