Table of Contents
Fetching ...

Early-Stopped Mirror Descent for Linear Regression over Convex Bodies

Tobias Wegel, Gil Kur, Patrick Rebeschini

TL;DR

This work establishes that early-stopped mirror descent (ESMD) can match the statistical performance of the constrained least squares estimator (LSE) for high-dimensional linear regression under arbitrary convex shape constraints. By designing optimization potentials from the Minkowski functional of the constraint set and analyzing ESMD through localized complexity tools like the critical radius and localized Gaussian width, the authors prove a risk bound showing ESMD is within a constant factor of the LSE’s risk, uniformly over the constraint set and for general design matrices. The framework yields sharp rates for several geometric families, including $\ell_p$-balls with $p\in[1,2)$, $M$-convex hulls, and both column-normalized and Gaussian designs, and demonstrates a transfer of minimax optimality from the LSE to ESMD. The results provide a principled, geometry-driven blueprint to achieve implicit regularization via early stopping across a broad spectrum of convex constraints, with implications for computational efficiency and statistical optimality in overparameterized regimes.

Abstract

Early-stopped iterative optimization methods are widely used as alternatives to explicit regularization, and direct comparisons between early-stopping and explicit regularization have been established for many optimization geometries. However, most analyses depend heavily on the specific properties of the optimization geometry or strong convexity of the empirical objective, and it remains unclear whether early-stopping could ever be less statistically efficient than explicit regularization for some particular shape constraint, especially in the overparameterized regime. To address this question, we study the setting of high-dimensional linear regression under additive Gaussian noise when the ground truth is assumed to lie in a known convex body and the task is to minimize the in-sample mean squared error. Our main result shows that for any convex body and any design matrix, up to an absolute constant factor, the worst-case risk of unconstrained early-stopped mirror descent with an appropriate potential is at most that of the least squares estimator constrained to the convex body. We achieve this by constructing algorithmic regularizers based on the Minkowski functional of the convex body.

Early-Stopped Mirror Descent for Linear Regression over Convex Bodies

TL;DR

This work establishes that early-stopped mirror descent (ESMD) can match the statistical performance of the constrained least squares estimator (LSE) for high-dimensional linear regression under arbitrary convex shape constraints. By designing optimization potentials from the Minkowski functional of the constraint set and analyzing ESMD through localized complexity tools like the critical radius and localized Gaussian width, the authors prove a risk bound showing ESMD is within a constant factor of the LSE’s risk, uniformly over the constraint set and for general design matrices. The framework yields sharp rates for several geometric families, including -balls with , -convex hulls, and both column-normalized and Gaussian designs, and demonstrates a transfer of minimax optimality from the LSE to ESMD. The results provide a principled, geometry-driven blueprint to achieve implicit regularization via early stopping across a broad spectrum of convex constraints, with implications for computational efficiency and statistical optimality in overparameterized regimes.

Abstract

Early-stopped iterative optimization methods are widely used as alternatives to explicit regularization, and direct comparisons between early-stopping and explicit regularization have been established for many optimization geometries. However, most analyses depend heavily on the specific properties of the optimization geometry or strong convexity of the empirical objective, and it remains unclear whether early-stopping could ever be less statistically efficient than explicit regularization for some particular shape constraint, especially in the overparameterized regime. To address this question, we study the setting of high-dimensional linear regression under additive Gaussian noise when the ground truth is assumed to lie in a known convex body and the task is to minimize the in-sample mean squared error. Our main result shows that for any convex body and any design matrix, up to an absolute constant factor, the worst-case risk of unconstrained early-stopped mirror descent with an appropriate potential is at most that of the least squares estimator constrained to the convex body. We achieve this by constructing algorithmic regularizers based on the Minkowski functional of the convex body.

Paper Structure

This paper contains 75 sections, 13 theorems, 170 equations, 3 figures, 1 table.

Key Result

Theorem 1

For any convex body $K \subset \mathbb{R}^d$ and any design matrix $\mathbf{X} \in\mathbb{R}^{n\times d}$, in both continuous and discrete time, there exists a strongly convex optimization potential $\psi$ and a stopping time $T$ of unconstrained mirror descent, such that the in-sample risk of ESMD with high probability over draws of the noise $\xi$.

Figures (3)

  • Figure 1: We plot an $M$-convex hull, a level set of the potential from \ref{['eq:M-convex-hull-potential']} with $\gamma=10$ and $\rho = 0.2$, the Bregman ball from \ref{['eq:inclusion']}, the set of points satisfying the offset condition \ref{['eq:offset-condition']}, and the mirror descent path that converges to $\mathop{\mathrm{arg\,min}}\limits_{\mathbf{X}\alpha=y}\psi(\alpha)$Gunasekar2018. As we can see, there is a time $t^\star$, where $\left\|\alpha_{t^\star}-\alpha^\star\right\|_2\leq \left\|\widehat{\alpha}_{{\mathop{\mathrm{LSE}}\nolimits}}-\alpha^\star\right\|_2$.
  • Figure 2: Simulation results for \ref{['thm:minimax-rate-1-2']}. For each $p$, we plot the log average in-sample risk over 20 experiments in solid, and the log $20$th and $80$th quantile as the shaded regions. The dashed lines show $(1/2-1/p)\log n$ for each $p$.
  • Figure 3: Optimization paths of different known and new potentials from this work for regression over $\ell_1$-balls. The squared hypentropy "fixes" issues arising for previously used hypentropy from Ghai2020. \ref{['fig:iterate-paths']}: For a two-dimensional problem, we plot paths on one data instance. Note that the paths can (somewhat) deviate from the LASSO path, but early-stopping still achieves minimax rates (\ref{['subsec:l1-constraints']}). \ref{['fig:risk-along-path']}: For $n=d=100$, we take $\mathbf{X}$ to be Gaussian and $\alpha^\star$ to be 1-sparse. We repeat the experiment 50 times and plot mean, $10$th and $90$th quantiles.

Theorems & Definitions (36)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem : Informal version of \ref{['cor:comparison-LSE']}
  • Definition 4
  • Definition 5
  • Definition 6
  • Lemma 1
  • Theorem 1
  • Corollary 1
  • ...and 26 more