Table of Contents
Fetching ...

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova

TL;DR

It is proved that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality, and a potential path towards training neural networks that achieve greater compression and generalization is outlined.

Abstract

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

TL;DR

It is proved that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality, and a potential path towards training neural networks that achieve greater compression and generalization is outlined.

Abstract

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

Paper Structure

This paper contains 108 sections, 20 theorems, 114 equations, 10 figures, 4 tables.

Key Result

Proposition 1

There exists a universal two-part code.

Figures (10)

  • Figure 1: Two-part code. One way to formalize the MDL principle is with a two-part code. Assume Alice and Bob agree on a two-part code $M$ and each are given inputs $X$. Alice then finds the hypothesis (e.g., model parameters) that enables sending Bob the labels $Y$ in the fewest total bits, balancing the complexity of the model with how well it fits the data. One key challenge we address is that the minimum codelength is dependent on the potentially arbitrary choice of prior $\alpha_M(h)$.
  • Figure 2: Constructing asymptotically optimal codes for Transformers. We construct a function $\texttt{zmap}$ that establishes that a Transformer (right) can effectively simulate a prefix Turing machine $T$ with any prefix $z$ encoded on a program tape (left), and therefore can represent any computable model function (as defined in section \ref{['sec:problem-setting']}) within some arbitrary time and space resource bound $R$. With this mapping between model functions and Transformer parameters established, we can select a prior that assigns probability to sets of parameters based on the algorithmic complexity of the function they compute, thus forming an asymptotically optimal code (Section \ref{['sec:two-part-codes-for-transformers']}).
  • Figure 3: Upper and lower bounds on minimum codelengths. The figure shows the minimum number of bits required to transmit labels $Y$ given inputs $X$, for different classes of codes. The bounds hold for any dataset ($X,Y$) up to an additive constant that does not depend on the dataset.
  • Figure 4: Transition function definition for standard single-tape Turing machine.
  • Figure 5: ALTA program specification for emulating a single-tape Turing machine.
  • ...and 5 more figures

Theorems & Definitions (43)

  • Definition 1: two-part code
  • Definition 2: universal two-part code
  • Proposition 1
  • Corollary 1
  • Definition 3: asymptotically optimal families of two-part codes
  • Proposition 2
  • Theorem 1
  • Definition 4: variational code
  • Definition 5: quasi-universal variational code
  • Proposition 3
  • ...and 33 more