Table of Contents
Fetching ...

Slicing Mutual Information Generalization Bounds for Neural Networks

Kimia Nadjahi, Kristjan Greenewald, Rickard Brüel Gabrielsson, Justin Solomon

TL;DR

The paper tackles the challenge of information-theoretic generalization bounds for neural networks in high dimensions by exploiting random subspace training and slicing-based dependence measures. It introduces disintegrated mutual information and k-Sliced Mutual Information to derive tighter, more scalable bounds for models trained on a random d-dimensional subspace W_{Θ, d}, with bounds that scale with the intrinsic dimension d rather than the full parameter dimension D. A rate-distortion extension relaxes the requirement that weights lie exactly on the subspace, introducing a distortion term that captures compressibility and enabling a practical regularization to promote near-compressibility during training. The authors provide both theoretical bounds and empirical validation on MNIST/CIFAR-10 and other tasks, demonstrating non-vacuous information-theoretic generalization bounds in realistic neural networks and offering actionable guidance for regularization and subspace dimension choices. Overall, the work advances the understanding of how compressibility and subspace projection influence generalization, delivering computationally tractable bounds and practical regularization strategies that align with modern network architectures.

Abstract

The ability of machine learning (ML) algorithms to generalize well to unseen data has been studied through the lens of information theory, by bounding the generalization error with the input-output mutual information (MI), i.e., the MI between the training data and the learned hypothesis. Yet, these bounds have limited practicality for modern ML applications (e.g., deep learning), due to the difficulty of evaluating MI in high dimensions. Motivated by recent findings on the compressibility of neural networks, we consider algorithms that operate by slicing the parameter space, i.e., trained on random lower-dimensional subspaces. We introduce new, tighter information-theoretic generalization bounds tailored for such algorithms, demonstrating that slicing improves generalization. Our bounds offer significant computational and statistical advantages over standard MI bounds, as they rely on scalable alternative measures of dependence, i.e., disintegrated mutual information and $k$-sliced mutual information. Then, we extend our analysis to algorithms whose parameters do not need to exactly lie on random subspaces, by leveraging rate-distortion theory. This strategy yields generalization bounds that incorporate a distortion term measuring model compressibility under slicing, thereby tightening existing bounds without compromising performance or requiring model compression. Building on this, we propose a regularization scheme enabling practitioners to control generalization through compressibility. Finally, we empirically validate our results and achieve the computation of non-vacuous information-theoretic generalization bounds for neural networks, a task that was previously out of reach.

Slicing Mutual Information Generalization Bounds for Neural Networks

TL;DR

The paper tackles the challenge of information-theoretic generalization bounds for neural networks in high dimensions by exploiting random subspace training and slicing-based dependence measures. It introduces disintegrated mutual information and k-Sliced Mutual Information to derive tighter, more scalable bounds for models trained on a random d-dimensional subspace W_{Θ, d}, with bounds that scale with the intrinsic dimension d rather than the full parameter dimension D. A rate-distortion extension relaxes the requirement that weights lie exactly on the subspace, introducing a distortion term that captures compressibility and enabling a practical regularization to promote near-compressibility during training. The authors provide both theoretical bounds and empirical validation on MNIST/CIFAR-10 and other tasks, demonstrating non-vacuous information-theoretic generalization bounds in realistic neural networks and offering actionable guidance for regularization and subspace dimension choices. Overall, the work advances the understanding of how compressibility and subspace projection influence generalization, delivering computationally tractable bounds and practical regularization strategies that align with modern network architectures.

Abstract

The ability of machine learning (ML) algorithms to generalize well to unseen data has been studied through the lens of information theory, by bounding the generalization error with the input-output mutual information (MI), i.e., the MI between the training data and the learned hypothesis. Yet, these bounds have limited practicality for modern ML applications (e.g., deep learning), due to the difficulty of evaluating MI in high dimensions. Motivated by recent findings on the compressibility of neural networks, we consider algorithms that operate by slicing the parameter space, i.e., trained on random lower-dimensional subspaces. We introduce new, tighter information-theoretic generalization bounds tailored for such algorithms, demonstrating that slicing improves generalization. Our bounds offer significant computational and statistical advantages over standard MI bounds, as they rely on scalable alternative measures of dependence, i.e., disintegrated mutual information and -sliced mutual information. Then, we extend our analysis to algorithms whose parameters do not need to exactly lie on random subspaces, by leveraging rate-distortion theory. This strategy yields generalization bounds that incorporate a distortion term measuring model compressibility under slicing, thereby tightening existing bounds without compromising performance or requiring model compression. Building on this, we propose a regularization scheme enabling practitioners to control generalization through compressibility. Finally, we empirically validate our results and achieve the computation of non-vacuous information-theoretic generalization bounds for neural networks, a task that was previously out of reach.
Paper Structure (38 sections, 14 theorems, 82 equations, 7 figures)

This paper contains 38 sections, 14 theorems, 82 equations, 7 figures.

Key Result

Theorem 3.1

Assume that $\ell(w, Z)$ is $\sigma$-sub-GaussianA random variable $X$ is $\sigma$-sub-Gaussian ($\sigma > 0$) under $\mu$ if for $t \in \mathbb{R}$, $\mathbb{E}_{\mu}[e^{t(X - \mathbb{E}_{\mu}[X])}] \leq e^{\sigma^2 t^2 / 2}$. under $Z \sim \mu$ for all $w \in \mathrm{W}$. Then, $|\mathrm{gen}(\mu,

Figures (7)

  • Figure 1: Gaussian mean estimation: generalization error and bound against $n$, for $D=15$, $d \in \{1, 5, 10, 15\}$. Errors and bounds decrease as $d \to 1$. The bound in bu2019 can only be applied for $d=D$. The scale is log-log.
  • Figure 2: Illustration of our bound (\ref{['eq:genbound_boundloss']}) and bu2019 on binary classification of Gaussian data of dimension $20$ with logistic regression trained on $\mathrm{W}_{\Theta, d}$
  • Figure 3: Generalization bounds with NNs for image classification. The weights are projected and quantized.
  • Figure 4: Generalization errors and rate-distortion bounds for feedforward NNs trained on MNIST. Results are averaged over 5 runs. Shaded areas represent the 2.5% and 97.5% percentiles. For each run, expectations are computed with Monte Carlo estimates over 5 samples of $\Theta$.
  • Figure 5: Generalization bounds on MNIST classification with neural networks trained on $\mathrm{W}_{\Theta, d}$
  • ...and 2 more figures

Theorems & Definitions (26)

  • Theorem 3.1: xu2017
  • Theorem 3.2: bu2019
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 5.1
  • Theorem 5.2
  • Lemma 1.1
  • proof
  • Theorem 1.2
  • proof : Proof of \ref{['thm:genbound']}
  • ...and 16 more