Table of Contents
Fetching ...

Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization

Gavin Zhang, Salar Fattahi, Richard Y. Zhang

TL;DR

This work addresses the slow convergence of gradient methods in over-parameterized nonconvex matrix factorization, especially in matrix sensing. The authors introduce PrecGD, a preconditioned gradient descent variant that uses a damped metric $P=(X^{\top}X+\eta I_r)\otimes I_n$ and a carefully chosen damping parameter $\eta$ (estimable from the current iterate) to recover linear convergence even when $r>r^\star$ and the ground-truth is ill-conditioned. In the noiseless setting, choosing $\eta_k=\sqrt{f(X_k)}$ yields gradient dominance in the $P$-norm and provable linear convergence; in the noisy setting, a variance-based damping achieves convergence to the minimax-optimal error floor at a fast rate. Experiments show PrecGD effectively restores linear convergence across variants of nonconvex matrix factorization, including nonsmooth losses, while maintaining cheap per-iteration cost comparable to standard gradient descent. The results suggest PrecGD as a practical, robust tool for large-scale low-rank recovery tasks with unknown true rank.

Abstract

In practical instances of nonconvex matrix factorization, the rank of the true solution $r^{\star}$ is often unknown, so the rank $r$ of the model can be overspecified as $r>r^{\star}$. This over-parameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$. We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the over-parameterized case, while also making it agnostic to possible ill-conditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by $\ell_{2}$ regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the over-parameterized regime.

Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization

TL;DR

This work addresses the slow convergence of gradient methods in over-parameterized nonconvex matrix factorization, especially in matrix sensing. The authors introduce PrecGD, a preconditioned gradient descent variant that uses a damped metric and a carefully chosen damping parameter (estimable from the current iterate) to recover linear convergence even when and the ground-truth is ill-conditioned. In the noiseless setting, choosing yields gradient dominance in the -norm and provable linear convergence; in the noisy setting, a variance-based damping achieves convergence to the minimax-optimal error floor at a fast rate. Experiments show PrecGD effectively restores linear convergence across variants of nonconvex matrix factorization, including nonsmooth losses, while maintaining cheap per-iteration cost comparable to standard gradient descent. The results suggest PrecGD as a practical, robust tool for large-scale low-rank recovery tasks with unknown true rank.

Abstract

In practical instances of nonconvex matrix factorization, the rank of the true solution is often unknown, so the rank of the model can be overspecified as . This over-parameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with to a sublinear rate when . We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the over-parameterized case, while also making it agnostic to possible ill-conditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the over-parameterized regime.

Paper Structure

This paper contains 21 sections, 24 theorems, 176 equations, 2 figures.

Key Result

Lemma 2

Let $\|D\|_{P}=\|D(X^{T}X+\eta I_r)^{1/2}\|_{F}$. Then we have where

Figures (2)

  • Figure 1: PrecGD converges linearly in the overparameterized regime. Convergence of regular gradient descent (GD), ScaledGD and PrecGD for noiseless matrix sensing (with data taken from NEURIPS2018_f8da71e5zhang2019sharp) from the same initial points and using the same learning rate $\alpha=2\times10^{-2}$. (Left $r=r^{*}$) Set $n=4$ and $r^{*}=r=2$. All three methods convergence at a linear rate, though GD converges at a slower rate due to ill-conditioning in the ground truth. (Right $r>r^{*}$) With $n=4$, $r=4$ and $r^{*}=2$, over-parameterization causes gradient descent to slow down to a sublinear rate. ScaledGD also behaves sporadically. Only PrecGD converges linearly to the ground truth.
  • Figure 2: Nonconvex matrix factorization with the $\ell_{p}$ empirical loss. We compare $\ell_{p}$ matrix sensing with $n=10$ and $r^{\star}=2$ and $\mathcal{A}$ taken from zhang2019sharp. The ground truth is chosen to be ill-conditioned ($\kappa=10^{2})$. For ScaledGD and PrecGD, we use the Polyak step-size in tong2021low. For GD we use a decaying step-size. (Top $r=r^{*}$) For all three values of $p$, GD stagnates due to the ill-conditioning of the ground truth, while ScaledGD and PrecGD converge linearly in all three cases. (Bottom $r>r^{*}$) With $r=4$, the problem is over-parameterized. GD again converges slowly and ScaledGD is sporadic due to near-singularity caused by over-parameterization. Once again we see PrecGD converge at a linear rate.

Theorems & Definitions (40)

  • Definition 1: RIP
  • Lemma 2: Lipschitz-like inequality
  • Lemma 3: Bounded gradient
  • Theorem 4: Noiseless gradient dominance
  • Corollary 5: Linear convergence
  • Proposition 6: Spectral Initialization
  • Theorem 7: Noisy measurements with optimal $\eta$
  • Theorem 8: Noisy measurements with variance proxy
  • Lemma 9: Lipschitz-like inequality; Lemma \ref{['lem:lipp']} restated
  • proof
  • ...and 30 more