Table of Contents
Fetching ...

Information-Theoretic Framework for Understanding Modern Machine-Learning

Meir Feder, Ruediger Urbanke, Yaniv Fogel

TL;DR

The paper proposes an information-theoretic framework that treats learning as universal probabilistic prediction under the log loss, using a Bayes mixture over model classes to obtain nonuniform regret bounds that depend on architecture-driven complexity. It links the true complexity to spectral properties of the Hessian/Fisher Information Matrix, yielding tractable proxies and explaining phenomena such as flat minima and inductive biases in deep networks and transformers. A key contribution is the complexity measure Comp(P, ε^2), which encapsulates prior mass near the best model and yields bounds that scale with an effective dimension k rather than the full parameter count, thereby reconciling high expressivity with generalization. The authors further show that SGD and its variants act as scalable approximations to the Bayesian mixture, enable practical learning in high-dimensional settings, and provide experiments that illustrate spectral behavior consistent with the theory, offering guidance for designing architectures with broad, advantageous complexity ranges.

Abstract

We introduce an information-theoretic framework that views learning as universal prediction under log loss, characterized through regret bounds. Central to the framework is an effective notion of architecture-based model complexity, defined by the probability mass or volume of models in the vicinity of the data-generating process, or its projection on the model class. This volume is related to spectral properties of the expected Hessian or the Fisher Information Matrix, leading to tractable approximations. We argue that successful architectures possess a broad complexity range, enabling learning in highly over-parameterized model classes. The framework sheds light on the role of inductive biases, the effectiveness of stochastic gradient descent, and phenomena such as flat minima. It unifies online, batch, supervised, and generative settings, and applies across the stochastic-realizable and agnostic regimes. Moreover, it provides insights into the success of modern machine-learning architectures, such as deep neural networks and transformers, suggesting that their broad complexity range naturally arises from their layered structure. These insights open the door to the design of alternative architectures with potentially comparable or even superior performance.

Information-Theoretic Framework for Understanding Modern Machine-Learning

TL;DR

The paper proposes an information-theoretic framework that treats learning as universal probabilistic prediction under the log loss, using a Bayes mixture over model classes to obtain nonuniform regret bounds that depend on architecture-driven complexity. It links the true complexity to spectral properties of the Hessian/Fisher Information Matrix, yielding tractable proxies and explaining phenomena such as flat minima and inductive biases in deep networks and transformers. A key contribution is the complexity measure Comp(P, ε^2), which encapsulates prior mass near the best model and yields bounds that scale with an effective dimension k rather than the full parameter count, thereby reconciling high expressivity with generalization. The authors further show that SGD and its variants act as scalable approximations to the Bayesian mixture, enable practical learning in high-dimensional settings, and provide experiments that illustrate spectral behavior consistent with the theory, offering guidance for designing architectures with broad, advantageous complexity ranges.

Abstract

We introduce an information-theoretic framework that views learning as universal prediction under log loss, characterized through regret bounds. Central to the framework is an effective notion of architecture-based model complexity, defined by the probability mass or volume of models in the vicinity of the data-generating process, or its projection on the model class. This volume is related to spectral properties of the expected Hessian or the Fisher Information Matrix, leading to tractable approximations. We argue that successful architectures possess a broad complexity range, enabling learning in highly over-parameterized model classes. The framework sheds light on the role of inductive biases, the effectiveness of stochastic gradient descent, and phenomena such as flat minima. It unifies online, batch, supervised, and generative settings, and applies across the stochastic-realizable and agnostic regimes. Moreover, it provides insights into the success of modern machine-learning architectures, such as deep neural networks and transformers, suggesting that their broad complexity range naturally arises from their layered structure. These insights open the door to the design of alternative architectures with potentially comparable or even superior performance.

Paper Structure

This paper contains 49 sections, 5 theorems, 122 equations, 1 figure.

Key Result

Theorem 1

Assume that $\Theta \in {\mathbb R}^d$, $\|\Theta\|_2 \leq R$, with a uniform prior $w(\theta)$ over $\Theta$. Suppose that for all $\theta \in \Theta$ and all $(x^n,y^n)\in \mathcal{X}^n\times\mathcal{Y}^n$, $P_{\theta}(y^n \,|\, x^n)=\prod_{t=1}^n P_{\theta}(y_t \,|\, x_t)$. Let Assume that $\hat{\theta}(\mathcal{S})$, the maximum likelihood estimator, satisfies $I(\hat{\theta}(\mathcal{S})) \a

Figures (1)

  • Figure 1: Largest Eigenvalues of the empirical average Hessian (from the largest to the smallest)

Theorems & Definitions (11)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Lemma 5: Prior mass of always inactive units
  • ...and 1 more