Transformers As Approximations of Solomonoff Induction

Nathan Young; Michael Witbrock

Transformers As Approximations of Solomonoff Induction

Nathan Young, Michael Witbrock

TL;DR

The paper investigates whether Transformer models approximate Solomonoff Induction more closely than other sequence predictors. Solomonoff Induction provides a universal Bayesian mixture over all computable distributions, with $P_M(x) = \sum_{s \in S} 2^{-l(s)}$ and $P_M(xy|x) = \frac{P_M(xy)}{P_M(x)}$, and is characterized by universality, bounded error, and Pareto optimality. The authors argue that Transformers can be modeled as bounded approximations to SolInd, discuss supporting evidence from universality and decomposition, and acknowledge practical limits from stochastic gradient descent and finite memory. They also propose alternate models and a research agenda to analyze neural networks within this framework, aiming to clarify why Transformers perform well and how to extend these insights into architecture design and explainability.

Abstract

Solomonoff Induction is an optimal-in-the-limit unbounded algorithm for sequence prediction, representing a Bayesian mixture of every computable probability distribution and performing close to optimally in predicting any computable sequence. Being an optimal form of computational sequence prediction, it seems plausible that it may be used as a model against which other methods of sequence prediction might be compared. We put forth and explore the hypothesis that Transformer models - the basis of Large Language Models - approximate Solomonoff Induction better than any other extant sequence prediction method. We explore evidence for and against this hypothesis, give alternate hypotheses that take this evidence into account, and outline next steps for modelling Transformers and other kinds of AI in this way.

Transformers As Approximations of Solomonoff Induction

TL;DR

and

, and is characterized by universality, bounded error, and Pareto optimality. The authors argue that Transformers can be modeled as bounded approximations to SolInd, discuss supporting evidence from universality and decomposition, and acknowledge practical limits from stochastic gradient descent and finite memory. They also propose alternate models and a research agenda to analyze neural networks within this framework, aiming to clarify why Transformers perform well and how to extend these insights into architecture design and explainability.

Abstract

Paper Structure (18 sections, 2 equations)

This paper contains 18 sections, 2 equations.

Introduction
Background
Hypothesis
Reasoning
Statement of hypotheses
Significance
Findings in favour
Universality
Decomposition
Findings against
Limits of stochastic gradient descent
Computational limits in practice
Transformers make poor Solomonoff Inductors
Syntheses and alternate hypotheses
Alternate Solomonoff models
...and 3 more sections

Transformers As Approximations of Solomonoff Induction

TL;DR

Abstract

Transformers As Approximations of Solomonoff Induction

Authors

TL;DR

Abstract

Table of Contents