Table of Contents
Fetching ...

Information-Theoretic Generalization Bounds for Deep Neural Networks

Haiyun He, Ziv Goldfeld

TL;DR

This work develops an information-theoretic perspective on generalization for deep networks by linking generalization error to layer-wise distributions of internal representations. It introduces two hierarchical bounds: a KL-divergence bound that tightens with depth and a Wasserstein distance bound that identifies a generalization funnel layer; the bounds are illustrated on a binary Gaussian mixture with linear networks and extended via SDPI to account for stochastic regularization. The SDPI analysis for Dropout, DropConnect, and Gaussian noise yields tightened per-layer contractions, and the Gibbs-algorithm specialization achieves an $O(\frac{1}{n})$ rate under suitable conditions, with deeper, narrower networks shown to generalize better in a finite-parameter setting. Overall, the paper provides a principled, architecture-aware framework linking depth and stochastic regularization to generalization performance in DNNs.

Abstract

Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: $\mathsf{Dropout}$, $\mathsf{DropConnect}$, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question.

Information-Theoretic Generalization Bounds for Deep Neural Networks

TL;DR

This work develops an information-theoretic perspective on generalization for deep networks by linking generalization error to layer-wise distributions of internal representations. It introduces two hierarchical bounds: a KL-divergence bound that tightens with depth and a Wasserstein distance bound that identifies a generalization funnel layer; the bounds are illustrated on a binary Gaussian mixture with linear networks and extended via SDPI to account for stochastic regularization. The SDPI analysis for Dropout, DropConnect, and Gaussian noise yields tightened per-layer contractions, and the Gibbs-algorithm specialization achieves an rate under suitable conditions, with deeper, narrower networks shown to generalize better in a finite-parameter setting. Overall, the paper provides a principled, architecture-aware framework linking depth and stochastic regularization to generalization performance in DNNs.

Abstract

Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: , , and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question.
Paper Structure (31 sections, 13 theorems, 87 equations, 5 figures, 1 table)

This paper contains 31 sections, 13 theorems, 87 equations, 5 figures, 1 table.

Key Result

Theorem 1

Suppose that the loss function $\ell(\mathbf{w},X,Y)$ is $\sigma$-sub-Gaussian under $P_{X,Y}$, for all $\mathbf{w}\in\mathcal{W}$. We have where

Figures (5)

  • Figure 1: $L$-layer feedforward network.
  • Figure 2: Illustration of binary Gaussian mixture data with $d_0=2$ and the Bayes optimal linear classifier.
  • Figure 3: Examples of DNNs with stochascity. (a) $l\textsuperscript{th}$ layer with $\mathsf{Dropout}$ probability $\delta_l$. (b) $l\textsuperscript{th}$ layer with $\mathsf{DropConnect}$ probabilities $\{\delta_{l,i,j}\}_{j=1}^{d_{l-1}}$. (c) Noisy DNN with injected isotropic Gaussian noise to the $l\textsuperscript{th}$ layer.
  • Figure 4: Examples of tightened generalization bounds for two types of NNs with stochasticity by adding an extra hidden layer.
  • Figure 5: Examples of tightened generalization bounds for two types of NNs with stochasticity by dividing one hidden layer into two separate hidden layers.

Theorems & Definitions (21)

  • Theorem 1: Hierarchical generalization bound
  • Remark 1: Special cases
  • Theorem 2: Min Wasserstein generalization bound
  • Remark 2: Comparison with KL divergence bound
  • Lemma 3: Prior and posterior of $(X_i,Y_i)$
  • Proposition 4: KL divergence bound evaluation
  • Proposition 5: Wasserstein distance based bound evaluation
  • Example 1: Numerical evaluation of \ref{['Prop: Gaussian W bd']}
  • Lemma 6: $\mathsf{Dropout}$ SDPI coefficient
  • Theorem 7: DNN with $\mathsf{Dropout}$ generalization bound
  • ...and 11 more