Table of Contents
Fetching ...

Minimum Description Length and Generalization Guarantees for Representation Learning

Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

TL;DR

This work develops a Minimum Description Length (MDL)–based compressibility framework to bound generalization in representation learning, challenging the conventional emphasis on mutual information $I(U;X)$. By incorporating block coding and lossy compression, it derives both predicted-label and latent-variable generalization bounds that reflect the encoder’s structure and admit data-dependent priors, connecting to PAC-MDL and f-CMI concepts. The paper provides three families of bounds using Type-I, Type-II, and Type-III symmetric priors and extends the theory to encoder–decoder representations, showing that latent-space MDL governs generalization in two-step learning. Empirical results on CIFAR-10 demonstrate practical gains of data-dependent priors over standard IB-based approaches, supporting the theoretical claims and suggesting new regularization strategies for representation learning.

Abstract

A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the "Minimum Description Length" (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder's input and the representation, which is often believed to reflect the algorithm's generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior. In particular, these new bounds reflect the structure of the encoder and are not vacuous for deterministic algorithms. Our compressibility approach, which is information-theoretic in nature, builds upon that of Blum-Langford for PAC-MDL bounds and introduces two essential ingredients: block-coding and lossy-compression. The latter allows our approach to subsume the so-called geometrical compressibility as a special case. To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck (IB) type encoders and representation learning. Finally, we partly exploit the theoretical results by introducing a new data-dependent prior. Numerical simulations illustrate the advantages of well-chosen such priors over classical priors used in IB.

Minimum Description Length and Generalization Guarantees for Representation Learning

TL;DR

This work develops a Minimum Description Length (MDL)–based compressibility framework to bound generalization in representation learning, challenging the conventional emphasis on mutual information . By incorporating block coding and lossy compression, it derives both predicted-label and latent-variable generalization bounds that reflect the encoder’s structure and admit data-dependent priors, connecting to PAC-MDL and f-CMI concepts. The paper provides three families of bounds using Type-I, Type-II, and Type-III symmetric priors and extends the theory to encoder–decoder representations, showing that latent-space MDL governs generalization in two-step learning. Empirical results on CIFAR-10 demonstrate practical gains of data-dependent priors over standard IB-based approaches, supporting the theoretical claims and suggesting new regularization strategies for representation learning.

Abstract

A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the "Minimum Description Length" (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder's input and the representation, which is often believed to reflect the algorithm's generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior. In particular, these new bounds reflect the structure of the encoder and are not vacuous for deterministic algorithms. Our compressibility approach, which is information-theoretic in nature, builds upon that of Blum-Langford for PAC-MDL bounds and introduces two essential ingredients: block-coding and lossy-compression. The latter allows our approach to subsume the so-called geometrical compressibility as a special case. To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck (IB) type encoders and representation learning. Finally, we partly exploit the theoretical results by introducing a new data-dependent prior. Numerical simulations illustrate the advantages of well-chosen such priors over classical priors used in IB.
Paper Structure (45 sections, 11 theorems, 104 equations, 3 figures, 1 table)

This paper contains 45 sections, 11 theorems, 104 equations, 3 figures, 1 table.

Key Result

Theorem 1

Figures (3)

  • Figure 1: Considered learning frameworks.
  • Figure 2: Accuracy during the test phase of our two-step prediction model trained using the standard VIB prior and our "lossless" CDVIB and "lossy" CDVIB priors computed for $M=5$. The values are averaged over 5 runs. The graphs are displayed together with 95% bootstrap confidence intervals.
  • Figure 3: Test and train performances of our two-step prediction model trained using the standard VIB prior, the "lossless" CDVIB prior, and the "lossy" CDVIB prior, both with $M=5$. The plots show the average over 5 runs and 95% bootstrap confidence intervals.

Theorems & Definitions (21)

  • Definition 1: Symmetric Priors
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Theorem 3
  • Theorem 4: Generalization Bound for Representation Learning Algorithms
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8: blum2003pac
  • ...and 11 more