Table of Contents
Fetching ...

Understanding the Implicit Regularization of Gradient Descent in Over-parameterized Models

Jianhao Ma, Geyu Liang, Salar Fattahi

TL;DR

The paper tackles why gradient descent tends to produce low-dimensional solutions in over-parameterized models by introducing a unified framework around an implicit region M induced by reparameterization. It shows that GD can converge to an M-SOSP within M if initialization is near M, saddles are efficiently escaped via infinitesimal perturbations, and the trajectory remains close to M, formalized through a signal–residual decomposition and a deviation-rate measure. A central contribution is Infinitesimally Perturbed Gradient Descent (IPGD), which provably achieves ε-M-SOSP convergence with poly-logarithmic perturbation costs and improved residual control, and is shown to perform near-linearly in over-parameterized matrix sensing under mild assumptions. The authors provide detailed theoretical results for IPGD, including Lipschitz conditions, region-closure, strict-saddle and regularity properties, and a comprehensive application to over-parameterized matrix sensing, supplemented by numerical experiments that extend to low-rank recovery and sparse recovery. The work offers practical insights into implicit regularization and suggests broad applicability across nonconvex, over-parameterized learning problems with potential extensions to stochastic and nonsmooth settings.

Abstract

Implicit regularization refers to the tendency of local search algorithms to converge to low-dimensional solutions, even when such structures are not explicitly enforced. Despite its ubiquity, the mechanism underlying this behavior remains poorly understood, particularly in over-parameterized settings. We analyze gradient descent dynamics and identify three conditions under which it converges to second-order stationary points within an implicit low-dimensional region: (i) suitable initialization, (ii) efficient escape from saddle points, and (iii) sustained proximity to the region. We show that these can be achieved through infinitesimal perturbations and a small deviation rate. Building on this, we introduce Infinitesimally Perturbed Gradient Descent (IPGD), which satisfies these conditions under mild assumptions. We provide theoretical guarantees for IPGD in over-parameterized matrix sensing and empirical evidence of its broader applicability.

Understanding the Implicit Regularization of Gradient Descent in Over-parameterized Models

TL;DR

The paper tackles why gradient descent tends to produce low-dimensional solutions in over-parameterized models by introducing a unified framework around an implicit region M induced by reparameterization. It shows that GD can converge to an M-SOSP within M if initialization is near M, saddles are efficiently escaped via infinitesimal perturbations, and the trajectory remains close to M, formalized through a signal–residual decomposition and a deviation-rate measure. A central contribution is Infinitesimally Perturbed Gradient Descent (IPGD), which provably achieves ε-M-SOSP convergence with poly-logarithmic perturbation costs and improved residual control, and is shown to perform near-linearly in over-parameterized matrix sensing under mild assumptions. The authors provide detailed theoretical results for IPGD, including Lipschitz conditions, region-closure, strict-saddle and regularity properties, and a comprehensive application to over-parameterized matrix sensing, supplemented by numerical experiments that extend to low-rank recovery and sparse recovery. The work offers practical insights into implicit regularization and suggests broad applicability across nonconvex, over-parameterized learning problems with potential extensions to stochastic and nonsmooth settings.

Abstract

Implicit regularization refers to the tendency of local search algorithms to converge to low-dimensional solutions, even when such structures are not explicitly enforced. Despite its ubiquity, the mechanism underlying this behavior remains poorly understood, particularly in over-parameterized settings. We analyze gradient descent dynamics and identify three conditions under which it converges to second-order stationary points within an implicit low-dimensional region: (i) suitable initialization, (ii) efficient escape from saddle points, and (iii) sustained proximity to the region. We show that these can be achieved through infinitesimal perturbations and a small deviation rate. Building on this, we introduce Infinitesimally Perturbed Gradient Descent (IPGD), which satisfies these conditions under mild assumptions. We provide theoretical guarantees for IPGD in over-parameterized matrix sensing and empirical evidence of its broader applicability.

Paper Structure

This paper contains 43 sections, 30 theorems, 127 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Suppose that $f$ is ${(L, \mathcal{M}, +\infty)}$-gradient-Lipschitz and ${(\rho, \mathcal{M}, +\infty)}$-Hessian-Lipschitz with some parameters $L,\rho>0$. Then, with an overwhelming probability, the PGD algorithm with perturbation radius $\gamma = \tilde{\Theta}({\epsilon}/L)$ outputs an $\epsilon

Figures (6)

  • Figure 1: PGD applied to matrix sensing converges under exact parameterization but fails to converge with over-parameterization. In this example, the true matrix $\mathbf{\Theta}^\star$ is a $20 \times 20$ PSD matrix with rank $r = 3$. The search rank is set to $r' = 3$ and $r'=4$ for exact- and over-parameterized settings.
  • Figure 2: When perturbation is added to the gradient, it contributes to two directions: deviation direction, which steers the iterations away from the implicit region $\mathcal{M}$, and the negative Hessian direction, which enables an escape from SSPs.
  • Figure 3: (Right) The stuck and escape regions of PGD are depicted in green and red, respectively. A wide stuck region necessitates proportionally larger perturbations for escape. (Left) Our results demonstrate that the width of the stuck region can be reduced to an arbitrarily small value, at the cost of a poly-logarithmic increase in the number of iterations.
  • Figure 4: Behavior of IPGD+ on low-rank matrix recovery. The deviation rate (denoted as $r(\cdot)$) and its cumulative counterpart (shown as $\bar{r}_T$) serve as strong indicators of whether IPGD+ converges to the ground truth. A large value of $\bar{r}_T$ suggests that the residual norm grows significantly, indicating that IPGD+ is directing the iterations away from the implicit region $\mathcal{M}$. Conversely, a small value of $\bar{r}_T$ implies that the residual norm remains small, thereby keeping the iterations close to the implicit region $\mathcal{M}$.
  • Figure 5: Behavior of IPGD+ on sparse recovery. Despite the rapid growth of the residual norm during the early iterations of IPGD+, driven by the large cumulative deviation rate, the residual norm eventually stabilizes at approximately $10^{-9}$, resulting in a final solution error of the same magnitude.
  • ...and 1 more figures

Theorems & Definitions (63)

  • Definition 1: $(L, \mathcal{M}, \tau)$-gradient-Lipschitz
  • Definition 2: $(\rho, \mathcal{M}, \tau)$-Hessian-Lipschitz
  • Definition 3: approximate ($\mathcal{M}$-)SOSP
  • Definition 4: Approximate SSP
  • Theorem 1: Theorem 3 of jin2017escape; informal
  • Theorem 2: Convergence of IPGD to an approximate SOSP
  • Example 1: Low-rank matrix recovery
  • Example 2: Sparse recovery
  • Example 3: Models with sparse basis function decomposition
  • Definition 5: Deviation rate
  • ...and 53 more