Table of Contents
Fetching ...

Large Language Models as Computable Approximations to Solomonoff Induction

Jun Wan, Lingrui Mei

TL;DR

The paper connects Large Language Models with Algorithmic Information Theory by proving that LLM training approximates the Solomonoff prior and next-token prediction approximates Solomonoff induction, thereby explaining in-context learning, few-shot adaptation, and scaling laws within a universal-induction framework. It introduces a computable approximation to Solomonoff probability via a program-based encoding of the LLM and demonstrates, through theoretical results and experiments, that inference approximates Solomonoff induction up to a model-dependent scaling factor. A convergence theorem ties Solomonoff prediction to computable target distributions, providing a principled basis for emergent LLM behaviors. Practically, the authors propose a low-confidence few-shot sample selection strategy that improves performance on text classification, especially for smaller models, illustrating the theory’s actionable impact on model development and data-efficient learning.

Abstract

The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.

Large Language Models as Computable Approximations to Solomonoff Induction

TL;DR

The paper connects Large Language Models with Algorithmic Information Theory by proving that LLM training approximates the Solomonoff prior and next-token prediction approximates Solomonoff induction, thereby explaining in-context learning, few-shot adaptation, and scaling laws within a universal-induction framework. It introduces a computable approximation to Solomonoff probability via a program-based encoding of the LLM and demonstrates, through theoretical results and experiments, that inference approximates Solomonoff induction up to a model-dependent scaling factor. A convergence theorem ties Solomonoff prediction to computable target distributions, providing a principled basis for emergent LLM behaviors. Practically, the authors propose a low-confidence few-shot sample selection strategy that improves performance on text classification, especially for smaller models, illustrating the theory’s actionable impact on model development and data-efficient learning.

Abstract

The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.

Paper Structure

This paper contains 22 sections, 4 theorems, 18 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 2

Let $\overline{f}(x,s)$ be a program constructed according to Definition def1, and define the approximate Solomonoff prior where $\ell({\overline{f}(x, s)})$ denotes the length of the program describing $\overline{f}(x,s)$. Then:

Figures (1)

  • Figure 1: Conceptual diagram of our theoretical framework linking LLM processes to AIT. LLM Process (Top): Training optimizes parameters ($\theta'$) via loss minimization $\mathcal{L}(\theta;\mathcal{D})$, while inference uses $\theta$ to predict $x_{t+1}$ from $x_{1:t}$ via $P_{\theta}(x_{t+1}|x_{1:t})$. AIT (Bottom): Kolmogorov Complexity $K(x)$ informs the Solomonoff prior (approximated as $\overline{M}(x)$), which underlies Solomonoff induction $M(x_{t+1}|x_{1:t})$. Crucially, LLM training is shown to approximate the Solomonoff prior, and LLM inference's predictive distribution $P_{\theta}$ approximates Solomonoff induction (via $\overline{M}(x_{t+1}|x_{1:t})$), bridging LLM operations with AIT.

Theorems & Definitions (8)

  • Definition 1: Language Model Generation Function
  • Theorem 2: LLM Training Approximates Solomonoff Prior
  • Theorem 3: LLM Inference Approximates Solomonoff Induction
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof