Information-theoretic analysis of world models in optimal reward maximizers

Alfred Harwood; Jose Faustino; Alex Altair

Information-theoretic analysis of world models in optimal reward maximizers

Alfred Harwood, Jose Faustino, Alex Altair

TL;DR

The paper investigates how much information about the environment is contained in an optimal reward-maximising policy within a finite Controlled Markov Process, where the environment has $n$ states and $m$ actions. It develops a general information-theoretic argument showing that, under a uniform prior over environments, the mutual information $I(X;Π)$ between the environment and the optimal deterministic policy equals $n log m$ bits for any non-constant reward function, across time-discounted, finite-horizon, and time-averaged reward schemes. The approach relies on partitioning the environment space into $m^n$ equal-volume regions corresponding to each deterministic policy, and establishing that the value function is real analytic in the environment; it then uses a zero-set lemma to show that the set of environments where multiple policies are optimal has measure zero. The result provides a precise information-theoretic lower bound on the implicit world representation necessary for optimality, with broad applicability beyond the specific reward aggregation method and without requiring memory or partial observability in the current formulation. This insight informs our understanding of how much internal world-model information a successful agent must encode to achieve optimal behavior and offers a formal target for evaluating internal representations in reinforcement-learning-like systems.

Abstract

An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.

Information-theoretic analysis of world models in optimal reward maximizers

TL;DR

The paper investigates how much information about the environment is contained in an optimal reward-maximising policy within a finite Controlled Markov Process, where the environment has

states and

actions. It develops a general information-theoretic argument showing that, under a uniform prior over environments, the mutual information

between the environment and the optimal deterministic policy equals

bits for any non-constant reward function, across time-discounted, finite-horizon, and time-averaged reward schemes. The approach relies on partitioning the environment space into

equal-volume regions corresponding to each deterministic policy, and establishing that the value function is real analytic in the environment; it then uses a zero-set lemma to show that the set of environments where multiple policies are optimal has measure zero. The result provides a precise information-theoretic lower bound on the implicit world representation necessary for optimality, with broad applicability beyond the specific reward aggregation method and without requiring memory or partial observability in the current formulation. This insight informs our understanding of how much internal world-model information a successful agent must encode to achieve optimal behavior and offers a formal target for evaluating internal representations in reinforcement-learning-like systems.

Abstract

states and

actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly

bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is

bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.

Paper Structure (24 sections, 19 theorems, 50 equations)

This paper contains 24 sections, 19 theorems, 50 equations.

Introduction
Does achieving goals require a world model?
Related work
Setup
Environments
Policies
Rewards
Mutual Information
Results
Proof Strategy and Notation
Reward Maximisation with discount rate
Reward Maximisation over Finite time horizons
Time-averaged reward maximisation
Conclusion
Acknowledgments
...and 9 more sections

Key Result

Lemma 3.1

Assume a goal represented by the maximization of a value function $V_\pi(x)$. We say that a policy $\pi_o$ is optimal for this value function in environment $x$ if $V_{\pi_o}(x)\geq V_{\pi_i}(x) \forall i$. Let If Then,

Theorems & Definitions (44)

Lemma 3.1
proof
Theorem 3.2
proof
Lemma 3.3
proof
Lemma 3.4
proof
Lemma 3.5
proof
...and 34 more

Information-theoretic analysis of world models in optimal reward maximizers

TL;DR

Abstract

Information-theoretic analysis of world models in optimal reward maximizers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (44)