Understanding the Implicit Regularization of Gradient Descent in Over-parameterized Models
Jianhao Ma, Geyu Liang, Salar Fattahi
TL;DR
The paper tackles why gradient descent tends to produce low-dimensional solutions in over-parameterized models by introducing a unified framework around an implicit region M induced by reparameterization. It shows that GD can converge to an M-SOSP within M if initialization is near M, saddles are efficiently escaped via infinitesimal perturbations, and the trajectory remains close to M, formalized through a signal–residual decomposition and a deviation-rate measure. A central contribution is Infinitesimally Perturbed Gradient Descent (IPGD), which provably achieves ε-M-SOSP convergence with poly-logarithmic perturbation costs and improved residual control, and is shown to perform near-linearly in over-parameterized matrix sensing under mild assumptions. The authors provide detailed theoretical results for IPGD, including Lipschitz conditions, region-closure, strict-saddle and regularity properties, and a comprehensive application to over-parameterized matrix sensing, supplemented by numerical experiments that extend to low-rank recovery and sparse recovery. The work offers practical insights into implicit regularization and suggests broad applicability across nonconvex, over-parameterized learning problems with potential extensions to stochastic and nonsmooth settings.
Abstract
Implicit regularization refers to the tendency of local search algorithms to converge to low-dimensional solutions, even when such structures are not explicitly enforced. Despite its ubiquity, the mechanism underlying this behavior remains poorly understood, particularly in over-parameterized settings. We analyze gradient descent dynamics and identify three conditions under which it converges to second-order stationary points within an implicit low-dimensional region: (i) suitable initialization, (ii) efficient escape from saddle points, and (iii) sustained proximity to the region. We show that these can be achieved through infinitesimal perturbations and a small deviation rate. Building on this, we introduce Infinitesimally Perturbed Gradient Descent (IPGD), which satisfies these conditions under mild assumptions. We provide theoretical guarantees for IPGD in over-parameterized matrix sensing and empirical evidence of its broader applicability.
