Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification
Gavin Zhang, Salar Fattahi, Richard Y. Zhang
TL;DR
This work tackles minimizing f(X)=φ(XX^{T}) over X∈R^{n×r} in the overparameterized Burer--Monteiro setting by introducing a preconditioned gradient descent (PrecGD) with updates X_{+}=X-α∇f(X)(X^{T}X+ηI)^{-1}. PrecGD achieves linear convergence independent of the rank deficiency λ_{ ext{min}}(X^{T}X) in the overparameterized regime, and a fixed η chosen relative to the current error ensures stability and good conditioning. The authors also prove global convergence under a perturbed framework (PMGD) and provide a posteriori certificates of global optimality via rank deficiency, along with thorough numerical demonstrations on problems like low-rank matrix recovery, 1-bit matrix sensing, and phase retrieval. Overall, the paper offers a practical, theory-backed method to retain fast convergence and certify optimality in overparameterized nonconvex factorizations, with favorable performance on large-scale problems where ill-conditioning would otherwise hinder progress.
Abstract
We consider using gradient descent to minimize the nonconvex function $f(X)=φ(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $φ$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $φ$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
