Table of Contents
Fetching ...

Transformers As Approximations of Solomonoff Induction

Nathan Young, Michael Witbrock

TL;DR

The paper investigates whether Transformer models approximate Solomonoff Induction more closely than other sequence predictors. Solomonoff Induction provides a universal Bayesian mixture over all computable distributions, with $P_M(x) = \sum_{s \in S} 2^{-l(s)}$ and $P_M(xy|x) = \frac{P_M(xy)}{P_M(x)}$, and is characterized by universality, bounded error, and Pareto optimality. The authors argue that Transformers can be modeled as bounded approximations to SolInd, discuss supporting evidence from universality and decomposition, and acknowledge practical limits from stochastic gradient descent and finite memory. They also propose alternate models and a research agenda to analyze neural networks within this framework, aiming to clarify why Transformers perform well and how to extend these insights into architecture design and explainability.

Abstract

Solomonoff Induction is an optimal-in-the-limit unbounded algorithm for sequence prediction, representing a Bayesian mixture of every computable probability distribution and performing close to optimally in predicting any computable sequence. Being an optimal form of computational sequence prediction, it seems plausible that it may be used as a model against which other methods of sequence prediction might be compared. We put forth and explore the hypothesis that Transformer models - the basis of Large Language Models - approximate Solomonoff Induction better than any other extant sequence prediction method. We explore evidence for and against this hypothesis, give alternate hypotheses that take this evidence into account, and outline next steps for modelling Transformers and other kinds of AI in this way.

Transformers As Approximations of Solomonoff Induction

TL;DR

The paper investigates whether Transformer models approximate Solomonoff Induction more closely than other sequence predictors. Solomonoff Induction provides a universal Bayesian mixture over all computable distributions, with and , and is characterized by universality, bounded error, and Pareto optimality. The authors argue that Transformers can be modeled as bounded approximations to SolInd, discuss supporting evidence from universality and decomposition, and acknowledge practical limits from stochastic gradient descent and finite memory. They also propose alternate models and a research agenda to analyze neural networks within this framework, aiming to clarify why Transformers perform well and how to extend these insights into architecture design and explainability.

Abstract

Solomonoff Induction is an optimal-in-the-limit unbounded algorithm for sequence prediction, representing a Bayesian mixture of every computable probability distribution and performing close to optimally in predicting any computable sequence. Being an optimal form of computational sequence prediction, it seems plausible that it may be used as a model against which other methods of sequence prediction might be compared. We put forth and explore the hypothesis that Transformer models - the basis of Large Language Models - approximate Solomonoff Induction better than any other extant sequence prediction method. We explore evidence for and against this hypothesis, give alternate hypotheses that take this evidence into account, and outline next steps for modelling Transformers and other kinds of AI in this way.
Paper Structure (18 sections, 2 equations)