Benign Overfitting for Regression with Trained Two-Layer ReLU Networks
Junhyung Park, Patrick Bloebaum, Shiva Prasad Kasiviswanathan
TL;DR
The paper analyzes square-loss regression using a finite-width two-layer ReLU network trained by gradient flow in the neural tangent kernel (NTK) regime. It introduces an approximation-estimation decomposition of the excess risk, treating gradient flow as an implicit regularizer to obtain generalization guarantees for arbitrary bounded regression functions and bounded noise without requiring uniform convergence. The authors prove a main generalization bound R(hat_f_Teps) - R(f^*) <= eps and a concomitant benign overfitting result, showing both vanishing empirical and excess risks under the same high-probability conditions. This work extends benign overfitting theory to non-smooth, finite-width neural networks, offering theoretical insight into how overparameterized networks can fit training data yet generalize well via implicit regularization in the NTK regime.
Abstract
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded. We operate in the neural tangent kernel regime, and our generalization result is developed via a decomposition of the excess risk into estimation and approximation errors, viewing gradient flow as an implicit regularizer. This decomposition in the context of neural networks is a novel perspective of gradient descent, and helps us avoid uniform convergence traps. In this work, we also establish that under the same setting, the trained network overfits to the data. Together, these results, establishes the first result on benign overfitting for finite-width ReLU networks for arbitrary regression functions.
