Table of Contents
Fetching ...

Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability

Lijia Yu, Yibo Miao, Yifan Zhu, Xiao-Shan Gao, Lijun Zhang

TL;DR

This work investigates how neural networks generalize when trained by empirical risk minimization, aiming to explain generalization beyond classical VC/Rademacher bounds. By introducing the notion of neural network expressive ability, the authors derive a lower bound on population accuracy that scales with the distribution’s expressiveness and the amount of training data, allowing independent growth of data and model size to yield generalization for over-parameterized nets. They also establish a lower bound on required data (sample complexity) tied to the distribution’s expressive cost and show that choosing activation functions aligned with the target distribution can drastically reduce data and width requirements. The paper further interprets deep learning phenomena—robust generalization, over-parameterization benefits, and loss-function effects—within this expressiveness framework, while acknowledging limitations to two-layer networks and outlining directions for extending to deeper models.

Abstract

The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To establish generalizability under less stringent assumptions, this paper investigates the generalizability of neural networks that minimize or approximately minimize empirical risk. We establish a lower bound for population accuracy based on the expressiveness of these networks, which indicates that with an adequate large number of training samples and network sizes, these networks, including over-parameterized ones, can generalize effectively. Additionally, we provide a necessary condition for generalization, demonstrating that, for certain data distributions, the quantity of training data required to ensure generalization exceeds the network size needed to represent the corresponding data distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust generalization, importance of over-parameterization, and effect of loss function on generalization.

Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability

TL;DR

This work investigates how neural networks generalize when trained by empirical risk minimization, aiming to explain generalization beyond classical VC/Rademacher bounds. By introducing the notion of neural network expressive ability, the authors derive a lower bound on population accuracy that scales with the distribution’s expressiveness and the amount of training data, allowing independent growth of data and model size to yield generalization for over-parameterized nets. They also establish a lower bound on required data (sample complexity) tied to the distribution’s expressive cost and show that choosing activation functions aligned with the target distribution can drastically reduce data and width requirements. The paper further interprets deep learning phenomena—robust generalization, over-parameterization benefits, and loss-function effects—within this expressiveness framework, while acknowledging limitations to two-layer networks and outlining directions for extending to deeper models.

Abstract

The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To establish generalizability under less stringent assumptions, this paper investigates the generalizability of neural networks that minimize or approximately minimize empirical risk. We establish a lower bound for population accuracy based on the expressiveness of these networks, which indicates that with an adequate large number of training samples and network sizes, these networks, including over-parameterized ones, can generalize effectively. Additionally, we provide a necessary condition for generalization, demonstrating that, for certain data distributions, the quantity of training data required to ensure generalization exceeds the network size needed to represent the corresponding data distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust generalization, importance of over-parameterization, and effect of loss function on generalization.

Paper Structure

This paper contains 32 sections, 36 theorems, 44 equations, 2 figures.

Key Result

Theorem 1.1

Let data distribution ${\mathcal{D}}$ satisfy the condition that a two-layer network with width $W_0$ can reach accuracy 1 over ${\mathcal{D}}$. Then with high probability of ${\mathcal{D}}_{tr}\sim{\mathcal{D}}^N$, if $N \ge \Omega(W_0^2)$ and $\hbox{width}({\mathcal{F}}) \ge \Omega(W_0)$ for ${\ma

Figures (2)

  • Figure 1: The accuracy on the different width networks.
  • Figure 2: The accuracy on the 200,400,600 width networks with different number of data.

Theorems & Definitions (75)

  • Theorem 1.1: Informal, Corollary \ref{['cor-44']}
  • Theorem 1.2: Informal, Section \ref{['s2']}
  • Definition 3.1
  • Remark 3.2
  • Proposition 3.3
  • proof
  • Definition 4.1
  • Remark 4.2
  • Proposition 4.3
  • Theorem 4.4
  • ...and 65 more