Table of Contents
Fetching ...

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Junhyung Park, Patrick Bloebaum, Shiva Prasad Kasiviswanathan

TL;DR

The paper analyzes square-loss regression using a finite-width two-layer ReLU network trained by gradient flow in the neural tangent kernel (NTK) regime. It introduces an approximation-estimation decomposition of the excess risk, treating gradient flow as an implicit regularizer to obtain generalization guarantees for arbitrary bounded regression functions and bounded noise without requiring uniform convergence. The authors prove a main generalization bound R(hat_f_Teps) - R(f^*) <= eps and a concomitant benign overfitting result, showing both vanishing empirical and excess risks under the same high-probability conditions. This work extends benign overfitting theory to non-smooth, finite-width neural networks, offering theoretical insight into how overparameterized networks can fit training data yet generalize well via implicit regularization in the NTK regime.

Abstract

We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded. We operate in the neural tangent kernel regime, and our generalization result is developed via a decomposition of the excess risk into estimation and approximation errors, viewing gradient flow as an implicit regularizer. This decomposition in the context of neural networks is a novel perspective of gradient descent, and helps us avoid uniform convergence traps. In this work, we also establish that under the same setting, the trained network overfits to the data. Together, these results, establishes the first result on benign overfitting for finite-width ReLU networks for arbitrary regression functions.

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

TL;DR

The paper analyzes square-loss regression using a finite-width two-layer ReLU network trained by gradient flow in the neural tangent kernel (NTK) regime. It introduces an approximation-estimation decomposition of the excess risk, treating gradient flow as an implicit regularizer to obtain generalization guarantees for arbitrary bounded regression functions and bounded noise without requiring uniform convergence. The authors prove a main generalization bound R(hat_f_Teps) - R(f^*) <= eps and a concomitant benign overfitting result, showing both vanishing empirical and excess risks under the same high-probability conditions. This work extends benign overfitting theory to non-smooth, finite-width neural networks, offering theoretical insight into how overparameterized networks can fit training data yet generalize well via implicit regularization in the NTK regime.

Abstract

We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded. We operate in the neural tangent kernel regime, and our generalization result is developed via a decomposition of the excess risk into estimation and approximation errors, viewing gradient flow as an implicit regularizer. This decomposition in the context of neural networks is a novel perspective of gradient descent, and helps us avoid uniform convergence traps. In this work, we also establish that under the same setting, the trained network overfits to the data. Together, these results, establishes the first result on benign overfitting for finite-width ReLU networks for arbitrary regression functions.
Paper Structure (28 sections, 22 theorems, 132 equations, 3 tables)

This paper contains 28 sections, 22 theorems, 132 equations, 3 tables.

Key Result

Theorem 1

For any $\epsilon > 0$ and $\delta > 0$, as long as both the input dimension ($d$) and the network width ($m$) are large enough, there exists some time $T$ such that with probability at least $1-\delta$, the approximation error is bounded as $\lVert f_T-f^\star\rVert_2\leq \epsilon/2$. Here, $f_T$ i

Theorems & Definitions (25)

  • Theorem 1: Approximation Error, Informal
  • Theorem 2: Estimation Error, Informal
  • Theorem 3: Overfitting, Informal
  • Theorem 4: Benign Overfitting, Informal
  • Definition 5: $\lambda_\epsilon$
  • Theorem 6: Approximation Error
  • Theorem 7: Estimation Error
  • Theorem 8: Generalization
  • Theorem 9: Overfitting
  • Theorem 10: Benign Overfitting
  • ...and 15 more