Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Junhyung Park; Patrick Bloebaum; Shiva Prasad Kasiviswanathan

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Junhyung Park, Patrick Bloebaum, Shiva Prasad Kasiviswanathan

TL;DR

The paper analyzes square-loss regression using a finite-width two-layer ReLU network trained by gradient flow in the neural tangent kernel (NTK) regime. It introduces an approximation-estimation decomposition of the excess risk, treating gradient flow as an implicit regularizer to obtain generalization guarantees for arbitrary bounded regression functions and bounded noise without requiring uniform convergence. The authors prove a main generalization bound R(hat_f_Teps) - R(f^*) <= eps and a concomitant benign overfitting result, showing both vanishing empirical and excess risks under the same high-probability conditions. This work extends benign overfitting theory to non-smooth, finite-width neural networks, offering theoretical insight into how overparameterized networks can fit training data yet generalize well via implicit regularization in the NTK regime.

Abstract

We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded. We operate in the neural tangent kernel regime, and our generalization result is developed via a decomposition of the excess risk into estimation and approximation errors, viewing gradient flow as an implicit regularizer. This decomposition in the context of neural networks is a novel perspective of gradient descent, and helps us avoid uniform convergence traps. In this work, we also establish that under the same setting, the trained network overfits to the data. Together, these results, establishes the first result on benign overfitting for finite-width ReLU networks for arbitrary regression functions.

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

TL;DR

Abstract

Paper Structure (28 sections, 22 theorems, 132 equations, 3 tables)

This paper contains 28 sections, 22 theorems, 132 equations, 3 tables.

Introduction
Related Works
Preliminaries
Model: Two-layer Fully-Connected Network with ReLU Activation
Assumptions on Parameters
Generalization Result
Benign Overfitting
Conclusion
Index of Notations
Additional Preliminaries
Vectors and Matrices
Standard Distributions and Concentration Results
Functions and Operators
Real Induction
U- and V-Statistics
...and 13 more sections

Key Result

Theorem 1

For any $\epsilon > 0$ and $\delta > 0$, as long as both the input dimension ($d$) and the network width ($m$) are large enough, there exists some time $T$ such that with probability at least $1-\delta$, the approximation error is bounded as $\lVert f_T-f^\star\rVert_2\leq \epsilon/2$. Here, $f_T$ i

Theorems & Definitions (25)

Theorem 1: Approximation Error, Informal
Theorem 2: Estimation Error, Informal
Theorem 3: Overfitting, Informal
Theorem 4: Benign Overfitting, Informal
Definition 5: $\lambda_\epsilon$
Theorem 6: Approximation Error
Theorem 7: Estimation Error
Theorem 8: Generalization
Theorem 9: Overfitting
Theorem 10: Benign Overfitting
...and 15 more

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

TL;DR

Abstract

Benign Overfitting for Regression with Trained Two-Layer ReLU Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (25)