Table of Contents
Fetching ...

Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization

Richard Y. Zhang

TL;DR

This work analyzes the nonconvex Burer–Monteiro factorization for semidefinite-program-like objectives by studying $f(X)=φ(XX^{T})$ with $φ$ $L$-smooth and $μ$-strongly convex. It proves that a constant-factor overparameterization, specifically $r> rac{1}{4}(L/μ-1)^{2}r^{igstar}$, eliminates spurious local minima, enabling global convergence from arbitrary initializations and surpassing the traditional $r\ge n$ threshold. A corollary shows that in the exact-parameterization regime with favorable conditioning ($L/μ<3$), no spurious local minima arise for $rigstar\,igleq r$, highlighting a sharp dependence on conditioning. The authors develop a two-stage SDP bounding framework and a valid inequality relating invariants α,β to characterize counterexamples, providing rigorous insight into how modest overparameterization reshapes the optimization landscape and informs algorithmic design for large-scale SDP-like problems.

Abstract

We consider minimizing a twice-differentiable, $L$-smooth, and $μ$-strongly convex objective $φ$ over an $n\times n$ positive semidefinite matrix $M\succeq0$, under the assumption that the minimizer $M^{\star}$ has low rank $r^{\star}\ll n$. Following the Burer--Monteiro approach, we instead minimize the nonconvex objective $f(X)=φ(XX^{T})$ over a factor matrix $X$ of size $n\times r$. This substantially reduces the number of variables from $O(n^{2})$ to as few as $O(n)$ and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank $r\ge r^{\star}$ is overparameterized by a \emph{constant factor} with respect to the true rank $r^{\star}$, namely as in $r>\frac{1}{4}(L/μ-1)^{2}r^{\star}$, then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of $r\ge n$, which we show is sharp in the absence of smoothness and strong convexity, but would increase the number of variables back up to $O(n^{2})$. Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if $φ$ is almost perfectly conditioned, with a condition number of $L/μ<3$. Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.

Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization

TL;DR

This work analyzes the nonconvex Burer–Monteiro factorization for semidefinite-program-like objectives by studying with -smooth and -strongly convex. It proves that a constant-factor overparameterization, specifically , eliminates spurious local minima, enabling global convergence from arbitrary initializations and surpassing the traditional threshold. A corollary shows that in the exact-parameterization regime with favorable conditioning (), no spurious local minima arise for , highlighting a sharp dependence on conditioning. The authors develop a two-stage SDP bounding framework and a valid inequality relating invariants α,β to characterize counterexamples, providing rigorous insight into how modest overparameterization reshapes the optimization landscape and informs algorithmic design for large-scale SDP-like problems.

Abstract

We consider minimizing a twice-differentiable, -smooth, and -strongly convex objective over an positive semidefinite matrix , under the assumption that the minimizer has low rank . Following the Burer--Monteiro approach, we instead minimize the nonconvex objective over a factor matrix of size . This substantially reduces the number of variables from to as few as and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank is overparameterized by a \emph{constant factor} with respect to the true rank , namely as in , then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of , which we show is sharp in the absence of smoothness and strong convexity, but would increase the number of variables back up to . Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if is almost perfectly conditioned, with a condition number of . Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.
Paper Structure (11 sections, 15 theorems, 59 equations, 1 figure)

This paper contains 11 sections, 15 theorems, 59 equations, 1 figure.

Key Result

theorem 1

Let $\phi:\mathbb{S}^{n}\to\mathbb{R}$ be twice-differentiable, $L$-smooth and $\mu$-strongly convex, let the minimizer $M^{\star}=\arg\min_{M\succeq0}\phi(M)$ have true rank $r^{\star}=\mathrm{rank}(M^{\star})$.

Figures (1)

  • Figure 1: Overparameterization eliminates spurious local minima. Stochastic gradient descent (SGD) with Nesterov momentum sutskever2013importance applied to an $f(X)\overset{\mathrm{def}}{=}\phi(XX^{T})$ with a spurious second-order point $X_{\mathrm{spur}}$ for $r=3$: (Left) With search rank $r=3$, GD remains stuck at $X\approx X_{\mathrm{spur}}$, resulting in 55 failures out of 100 trials. (Right) Overparameterizing to $r=4$ eliminates $X_{\mathrm{spur}}$ as a spurious second-order point, and GD now succeeds in all 100 trials. (Set $\phi(M)=\sum_{i,j=1}^{n}\phi_{i,j}(M)$ where $\phi_{i,j}(M)=\frac{1}{2}|\left\langle A^{(i,j)},M-M^{\star}\right\rangle |^{2}$ as in \ref{['exa:overparam']} with $n=5$, $r=3,$ and $r^{\star}=2$, set $V=0$ and uniformly sample $X$ from $\|X-X_{\mathrm{spur}}\|_{F}\le0.1$, and then run $V_{\mathrm{new}}=\beta V-\alpha\nabla f_{i,j}(X)$ and $X_{\mathrm{new}}=X+\beta V_{\mathrm{new}}-\alpha\nabla f_{i,j}(X)$, with learning rate $\alpha=1\times10^{-1}$ and momentum $\beta=0.9$. Sample indices $i,j$ are randomly shuffled every 1 epoch = 25 iterations.)

Theorems & Definitions (28)

  • theorem 1: Overparameterization
  • corollary thmcountercorollary: Exact parameterization
  • proposition thmcounterproposition: Strict saddle property
  • lemma thmcounterlemma
  • proof
  • corollary thmcountercorollary: Restricted isometry property
  • proof
  • definition thmcounterdefinition
  • lemma thmcounterlemma: SDP formulation
  • proof
  • ...and 18 more