Table of Contents
Fetching ...

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Dan Qiao, Kaiqi Zhang, Esha Singh, Daniel Soudry, Yu-Xiang Wang

TL;DR

This work analyzes generalization of two-layer ReLU networks in 1D nonparametric regression with noisy labels, showing that large gradient-descent step sizes bias training toward stable, simple minima rather than interpolation. The authors develop a function-space theory linking GD stability to a weighted TV$^{(1)}$ constraint, deriving generalization bounds and a near-minimax MSE rate for BV$^{(1)}$ targets, all without explicit regularization. They demonstrate that stable minima cannot interpolate noisy data and that, inside the data support, the generalization gap vanishes as sample size grows, with learning-rate tuning acting as an implicit regularizer. Empirical results corroborate the theory, showing large learning rates yield sparse linear-spline representations and improved generalization, offering a non-kernel pathway to near-optimal rates in nonparametric regression.

Abstract

We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emph{stably} converge to. We show that gradient descent with a fixed learning rate $η$ can only find local minima that represent smooth functions with a certain weighted \emph{first order total variation} bounded by $1/η- 1/2 + \widetilde{O}(σ+ \sqrt{\mathrm{MSE}})$ where $σ$ is the label noise level, $\mathrm{MSE}$ is short for mean squared error against the ground truth, and $\widetilde{O}(\cdot)$ hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of $\widetilde{O}(n^{-4/5})$ within the strict interior of the support of the $n$ data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression.

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

TL;DR

This work analyzes generalization of two-layer ReLU networks in 1D nonparametric regression with noisy labels, showing that large gradient-descent step sizes bias training toward stable, simple minima rather than interpolation. The authors develop a function-space theory linking GD stability to a weighted TV constraint, deriving generalization bounds and a near-minimax MSE rate for BV targets, all without explicit regularization. They demonstrate that stable minima cannot interpolate noisy data and that, inside the data support, the generalization gap vanishes as sample size grows, with learning-rate tuning acting as an implicit regularizer. Empirical results corroborate the theory, showing large learning rates yield sparse linear-spline representations and improved generalization, offering a non-kernel pathway to near-optimal rates in nonparametric regression.

Abstract

We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emph{stably} converge to. We show that gradient descent with a fixed learning rate can only find local minima that represent smooth functions with a certain weighted \emph{first order total variation} bounded by where is the label noise level, is short for mean squared error against the ground truth, and hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of within the strict interior of the support of the data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression.
Paper Structure (37 sections, 32 theorems, 135 equations, 9 figures)

This paper contains 37 sections, 32 theorems, 135 equations, 9 figures.

Key Result

Lemma 2.2

Consider the update rule in Definition def:ls, for any $\epsilon>0$, a local minimum $\theta^\star$ is an $\epsilon$ linearly stable minimum of $\mathcal{L}$ if and only if $\lambda_{\max}(\nabla^2\mathcal{L}(\theta^\star))\leq\frac{2}{\eta}$.

Figures (9)

  • Figure 1: We show that "Large step size selects simple functions that generalize."
  • Figure 2: Empirical evidence of our claim. Constant step size gradient descent-trained two-layer ReLU neural networks generalize because of minima stability. The left panel shows that with increasing step size, gradient descent finds smoother solutions (linear splines) with a smaller number of knots. The middle panel illustrates our theoretical result with a numerically accurate upper bound using $1/\eta + O(1)$ of the curvature and TV1-complexity of the smooth solution. The right panel shows that tuning $\eta$ gives the classical U-shape bias-variance tradeoff for overparameterized NN.
  • Figure 3: Highlights of our numerical simulation for large step size ($\eta=0.4$, first row) and small step size ($\eta=0.01$, second row) gradient descent training of a univariate ReLU NN with $n=30$ noisy observations and $k=100$ hidden neurons. From left to right, the three columns illustrate (a) Trained NN function (b) Learning curves (c) Learned basis functions (each of the 100 neurons).
  • Figure 4: Illustration of the solutions gradient descent with learning rate $\eta$ converges to (Part I). As $\eta$ decreases, the fitted function goes from simple to complex. Any line below the $\sigma^2$ line satisfies the "optimized" assumption from Corollary \ref{['cor:tv1']} and Theorem \ref{['thm:under']}.
  • Figure 5: Illustration of the solutions gradient descent with learning rate $\eta$ converges to (Part II). As $\eta$ decreases further, the fitted function starts to overfit to the noisy label.
  • ...and 4 more figures

Theorems & Definitions (62)

  • Definition 2.1: Linear stability
  • Lemma 2.2
  • Definition 2.3: Below Edge of Stability
  • Theorem 3.1
  • Theorem 4.1
  • Corollary 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Lemma B.1: Restate Lemma \ref{['lem:stable']}
  • proof : Proof of Lemma \ref{['lem:restable']}
  • ...and 52 more