The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest

Michael Timothy Bennett

The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest

Michael Timothy Bennett

TL;DR

This work challenges the view that the optimal hypothesis is the shortest by proving that, under a uniform distribution over tasks, the probability of generalisation is maximised by inferring the weakest valid hypotheses, formalised via $p\left(\mathbf{h} \in M_\omega \;|\; \mathbf{h} \in M_\alpha, \alpha \subset \omega\right) = \frac{2^{|\overline{Z_{S_\alpha}} \cap Z_{\mathbf{h}}|}}{2^{|\overline{Z_{S_\alpha}}|}}$ with the prior $p(\mathbf{h}) \propto 2^{|Z_{\mathbf{h}}|}$. The authors introduce an enactive cognition lattice to define weakness and description length proxies, proving weakness is necessary and sufficient for maximal generalisation while description length is neither. Empirical tests on 8-bit binary addition and multiplication show weakness yields 1.1–5× higher generalisation rates and 1.03–1.56× greater extent than MDL. These results offer a principled explanation for robust generalisation in AI systems (e.g., the Apperception Engine) and motivate exploring vocabulary design and inductive biases in neural architectures to promote weak, generalisable hypotheses.

Abstract

If $A$ and $B$ are sets such that $A \subset B$, generalisation may be understood as the inference from $A$ of a hypothesis sufficient to construct $B$. One might infer any number of hypotheses from $A$, yet only some of those may generalise to $B$. How can one know which are likely to generalise? One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise (a proxy for intelligence). We examine this in the context of a mathematical formalism of enactive cognition. We show that compression is neither necessary nor sufficient to maximise performance (measured in terms of the probability of a hypothesis generalising). We formulate a proxy unrelated to length or simplicity, called weakness. We show that if tasks are uniformly distributed, then there is no choice of proxy that performs at least as well as weakness maximisation in all tasks while performing strictly better in at least one. In experiments comparing maximum weakness and minimum description length in the context of binary arithmetic, the former generalised at between $1.1$ and $5$ times the rate of the latter. We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind's Apperception Engine is able to generalise effectively.

The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest

TL;DR

with the prior

. The authors introduce an enactive cognition lattice to define weakness and description length proxies, proving weakness is necessary and sufficient for maximal generalisation while description length is neither. Empirical tests on 8-bit binary addition and multiplication show weakness yields 1.1–5× higher generalisation rates and 1.03–1.56× greater extent than MDL. These results offer a principled explanation for robust generalisation in AI systems (e.g., the Apperception Engine) and motivate exploring vocabulary design and inductive biases in neural architectures to promote weak, generalisable hypotheses.

Abstract

and

are sets such that

, generalisation may be understood as the inference from

of a hypothesis sufficient to construct

. One might infer any number of hypotheses from

, yet only some of those may generalise to

. How can one know which are likely to generalise? One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise (a proxy for intelligence). We examine this in the context of a mathematical formalism of enactive cognition. We show that compression is neither necessary nor sufficient to maximise performance (measured in terms of the probability of a hypothesis generalising). We formulate a proxy unrelated to length or simplicity, called weakness. We show that if tasks are uniformly distributed, then there is no choice of proxy that performs at least as well as weakness maximisation in all tasks while performing strictly better in at least one. In experiments comparing maximum weakness and minimum description length in the context of binary arithmetic, the former generalised at between

and

times the rate of the latter. We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind's Apperception Engine is able to generalise effectively.

Paper Structure (13 sections, 3 theorems, 1 equation, 2 tables)

This paper contains 13 sections, 3 theorems, 1 equation, 2 tables.

Introduction
Background definitions
Formalising induction
Proofs
Experiments
Setup
Trials
Training phase:
Testing phase:
Results
Concluding remarks
The Apperception Engine:
Neural networks:

Key Result

proposition thmcounterproposition

Weakness is a proxy sufficient to maximise the probability that induction generalises from $\alpha$ to $\omega$.

Theorems & Definitions (15)

definition thmcounterdefinition: environment
definition thmcounterdefinition: implementable language
definition thmcounterdefinition: $\mathfrak{v}$-task
definition thmcounterdefinition: probability
definition thmcounterdefinition: generalisation
definition thmcounterdefinition: child and parent
definition thmcounterdefinition: proxy for intelligence
definition thmcounterdefinition: induction
proposition thmcounterproposition: sufficiency
proof
...and 5 more

The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest

TL;DR

Abstract

The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (15)