Table of Contents
Fetching ...

Architecture independent generalization bounds for overparametrized deep ReLU networks

Anandatheertha Bapu, Thomas Chen, Chun-Kai Kevin Chien, Patricia Muñoz Ewald, Andrew G. Moore

TL;DR

The paper addresses the puzzling generalization behavior of overparametrized deep ReLU networks by deriving architecture-independent generalization bounds that depend on the metric geometry of the data and on activation regularity and weight norms. It introduces the existence of explicitly constructible zero-loss minimizers in strongly overparametrized regimes and proves a uniform generalization bound that remains independent of depth or width, leveraging a Lipschitz-continuous activation and a data-driven Chamfer-distance bound. The key contributions include a priori and generalization bounds tied to data geometry, the construction of zero-loss minimizers for ReLU nets, and a detailed comparison with VC-based probabilistic bounds, complemented by MNIST experiments that support the theory with an average test-bound agreement around 22%. The work provides a data-geometry-centric explanation for generalization in the overparametrized regime and suggests practical implications for understanding and controlling generalization via data structure and weight norms.

Abstract

We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove a uniform generalization bound that is independent of the network architecture. We perform computational experiments of our theoretical results with MNIST, and obtain agreement with the true test error within a 22 % margin on average.

Architecture independent generalization bounds for overparametrized deep ReLU networks

TL;DR

The paper addresses the puzzling generalization behavior of overparametrized deep ReLU networks by deriving architecture-independent generalization bounds that depend on the metric geometry of the data and on activation regularity and weight norms. It introduces the existence of explicitly constructible zero-loss minimizers in strongly overparametrized regimes and proves a uniform generalization bound that remains independent of depth or width, leveraging a Lipschitz-continuous activation and a data-driven Chamfer-distance bound. The key contributions include a priori and generalization bounds tied to data geometry, the construction of zero-loss minimizers for ReLU nets, and a detailed comparison with VC-based probabilistic bounds, complemented by MNIST experiments that support the theory with an average test-bound agreement around 22%. The work provides a data-geometry-centric explanation for generalization in the overparametrized regime and suggests practical implications for understanding and controlling generalization via data structure and weight norms.

Abstract

We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove a uniform generalization bound that is independent of the network architecture. We perform computational experiments of our theoretical results with MNIST, and obtain agreement with the true test error within a 22 % margin on average.

Paper Structure

This paper contains 17 sections, 4 theorems, 83 equations, 3 figures, 1 table.

Key Result

Proposition 1.2

The loss discrepancy satisfies the a priori bound where $R$ is the radius of the smallest ball centered at the origin in ${\mathbb R}^{M_0}\times{\mathbb R}^Q$ containing ${\mathcal{S}}^{test}\cup{\mathcal{S}}^{train}$, and ${\rm diam}({\mathcal{S}}^{test}\cup{\mathcal{S}}^{train})$ is the diameter of ${\mathcal{S}}^{test}\cup{\mathcal{S}}^{train}$ depends on the operator norms of the trained wei

Figures (3)

  • Figure 1: Each plot shows the test error $\mathcal{E}^{test}$ and the bound \ref{['exp-bound']} for the constructed or trained shallow networks. The plot for TFZL is very similar to the one for ZL (left).
  • Figure 2: This plot shows the resulting test error for a fixed test set and several shallow networks trained with training sets of different sizes $n$.
  • Figure 3: The (average) bound \ref{['exp-bound']} computed for several different randomly initialized neural networks, with different architectures and for varying number $n$ of training samples. The missing data points for certain architectures indicate that the trained networks could not achieve $\mathcal{E}^{train} < 0.001$.

Theorems & Definitions (8)

  • Definition 1.1
  • Proposition 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Lemma A.1
  • proof
  • Remark A.2
  • Remark A.3