Learning Universal Predictors

Jordi Grau-Moya; Tim Genewein; Marcus Hutter; Laurent Orseau; Grégoire Delétang; Elliot Catt; Anian Ruoss; Li Kevin Wenliang; Christopher Mattern; Matthew Aitchison; Joel Veness

Learning Universal Predictors

Jordi Grau-Moya, Tim Genewein, Marcus Hutter, Laurent Orseau, Grégoire Delétang, Elliot Catt, Anian Ruoss, Li Kevin Wenliang, Christopher Mattern, Matthew Aitchison, Joel Veness

TL;DR

Introduces Solomonoff Induction and its universal prior $M$, highlighting its incomputability and the motivation to approximate it via meta-learning. Frames meta-learning as amortized Solomonoff Induction, showing that neural models trained on diverse, algorithmically generated data can approximate the Bayesian mixture over programs and converge toward the normalized prior $M^{norm}$ under suitable assumptions. Defines computable Solomonoff data generators $M_{s,L,n}$ and the normalized prior $M^{norm}$, proving consistency of empirical estimates and outlining training with fixed-length sequences to realize convergence to $\hat{M}^{norm}$. Extends to non-uniform sampling with $M_U^Q$ and demonstrates universality under mild conditions, then provides extensive experiments across UTMs, VOMS, and Chomsky-hierarchy tasks showing scaling and universal-data jointly promote increasingly universal predictive capabilities with transferable patterns across domains.

Abstract

Meta-learning has emerged as a powerful approach to train neural networks to learn new tasks quickly from limited data. Broad exposure to different tasks leads to versatile representations enabling general problem solving. But, what are the limits of meta-learning? In this work, we explore the potential of amortizing the most powerful universal predictor, namely Solomonoff Induction (SI), into neural networks via leveraging meta-learning to its limits. We use Universal Turing Machines (UTMs) to generate training data used to expose networks to a broad range of patterns. We provide theoretical analysis of the UTM data generation processes and meta-training protocols. We conduct comprehensive experiments with neural architectures (e.g. LSTMs, Transformers) and algorithmic data generators of varying complexity and universality. Our results suggest that UTM data is a valuable resource for meta-learning, and that it can be used to train neural networks capable of learning universal prediction strategies.

Learning Universal Predictors

TL;DR

Introduces Solomonoff Induction and its universal prior

, highlighting its incomputability and the motivation to approximate it via meta-learning. Frames meta-learning as amortized Solomonoff Induction, showing that neural models trained on diverse, algorithmically generated data can approximate the Bayesian mixture over programs and converge toward the normalized prior

under suitable assumptions. Defines computable Solomonoff data generators

and the normalized prior

, proving consistency of empirical estimates and outlining training with fixed-length sequences to realize convergence to

. Extends to non-uniform sampling with

and demonstrates universality under mild conditions, then provides extensive experiments across UTMs, VOMS, and Chomsky-hierarchy tasks showing scaling and universal-data jointly promote increasingly universal predictive capabilities with transferable patterns across domains.

Abstract

Paper Structure (41 sections, 8 theorems, 18 equations, 11 figures, 3 tables)

This paper contains 41 sections, 8 theorems, 18 equations, 11 figures, 3 tables.

Background
Meta-Learning as an Approximation to Solomonoff Induction
The right dataset: Estimating Solomonoff from Solomonoff Samples
Training Models on Solomonoff Data using Fixed-Sequence Lengths
Solomonoff from Non-Uniform Samples
Experimental Methodology
Results
Discussion and Conclusions
Appendix
Solomonoff samples
Sampling from semimeasures.
Limit normalization.
Training with Transformers
Using Transformers for estimating $M$.
Limit-normalized $\widetilde{M}$.
...and 26 more sections

Key Result

Proposition 3

Let $D:=(x^1,...,x^J)$ be $J$ (in)finite sequences sampled from a semimeasure $\mu$ (e.g. $M$). We can estimate $\mu$ as follows: $\hat{\mu}_D(x) ~:=~ \frac{1}{|D|}\sum_{y\in D}[\![\ell(y)\geq\ell(x)~\wedge~y_{1:\ell(x)}=x]\!] ~\stackrel{w.p.1}\longrightarrow \mu(x) ~~\text{for}~~ |D|\to\infty$.

Figures (11)

Figure 1: Summary of our meta-learning methodology.
Figure 2: Evaluation on VOMS data. Left: Example sequence and highly overlapped predictions of Transformer-L (red) and Bayes-optimal CTW predictor (blue). Lower panels show instantaneous and cumulative regret w.r.t. the ground-truth. Middle: Mean cumulative regret over $6$k sequences (length $256$, max. CTW tree depth $24$, in-distribution) for different networks ($3$ seeds) and sizes (S, M, L). Larger models perform better for all architectures, and the Transformer-L and LSTM-L match the optimal CTW predictor. Right: Length generalization ($1024$ steps). LSTMs generalize to longer length, whereas Transformers do not.
Figure 3: Evaluation on $6$k sequences from the Chomsky hierarchy tasks ($400$ per task). As the model size increases, cumulative regret (Left) and accuracy (Middle) improve across all architectures. Overall, the Transformer-L achieves the best performance by a margin. Right: Length generalization ($1024$ steps). Detailed results per task are in Figure \ref{['fig:chomsky_results_by_task']} on the Appendix.
Figure 4: Evaluation on the UTM data generator with $6$k sequences. Left: The larger the architecture the lower the cumulative regret. We see better performance than the non-trivial baseline Solomonoff Upper Bound (UB). Middle: The mean accuracy on UTM data shows the models can quickly learn UTM patterns. Right: Length generalization ($1024$ steps). Detailed results per program length are in Figure \ref{['fig:results_per_program_length_utm']}.
Figure 5: Transfer learning from UTM-trained models on $3$k trajectories. Mean cumulative regret (Left) and accuracy (Middle-Left) of neural models trained on UTM data evaluated against the tasks of the Chosmky hierarchy. We observe a small increase in accuracy (transfer) from the Transformer models. Transfer to CTW is shown in the right two panels: Middle-Right: mean cumulative regret, Right: mean accuracy; 'Naive' is a random uniform predictor.
...and 6 more figures

Theorems & Definitions (14)

Definition 1: Monotonicity
Definition 2: (Monotone) Solomonoff Prior
Definition 3: Normalized Solomonoff Prior
Proposition 3
Definition 4: Computable Solomonoff Prior
Proposition 4
Remark 5
Proposition 5
Theorem 6: Universality of generalized Solomonoff semimeasures
Proposition 6
...and 4 more

Learning Universal Predictors

TL;DR

Abstract

Learning Universal Predictors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (14)