Table of Contents
Fetching ...

Information-theoretic analysis of world models in optimal reward maximizers

Alfred Harwood, Jose Faustino, Alex Altair

TL;DR

The paper investigates how much information about the environment is contained in an optimal reward-maximising policy within a finite Controlled Markov Process, where the environment has $n$ states and $m$ actions. It develops a general information-theoretic argument showing that, under a uniform prior over environments, the mutual information $I(X;Π)$ between the environment and the optimal deterministic policy equals $n log m$ bits for any non-constant reward function, across time-discounted, finite-horizon, and time-averaged reward schemes. The approach relies on partitioning the environment space into $m^n$ equal-volume regions corresponding to each deterministic policy, and establishing that the value function is real analytic in the environment; it then uses a zero-set lemma to show that the set of environments where multiple policies are optimal has measure zero. The result provides a precise information-theoretic lower bound on the implicit world representation necessary for optimality, with broad applicability beyond the specific reward aggregation method and without requiring memory or partial observability in the current formulation. This insight informs our understanding of how much internal world-model information a successful agent must encode to achieve optimal behavior and offers a formal target for evaluating internal representations in reinforcement-learning-like systems.

Abstract

An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.

Information-theoretic analysis of world models in optimal reward maximizers

TL;DR

The paper investigates how much information about the environment is contained in an optimal reward-maximising policy within a finite Controlled Markov Process, where the environment has states and actions. It develops a general information-theoretic argument showing that, under a uniform prior over environments, the mutual information between the environment and the optimal deterministic policy equals bits for any non-constant reward function, across time-discounted, finite-horizon, and time-averaged reward schemes. The approach relies on partitioning the environment space into equal-volume regions corresponding to each deterministic policy, and establishing that the value function is real analytic in the environment; it then uses a zero-set lemma to show that the set of environments where multiple policies are optimal has measure zero. The result provides a precise information-theoretic lower bound on the implicit world representation necessary for optimality, with broad applicability beyond the specific reward aggregation method and without requiring memory or partial observability in the current formulation. This insight informs our understanding of how much internal world-model information a successful agent must encode to achieve optimal behavior and offers a formal target for evaluating internal representations in reinforcement-learning-like systems.

Abstract

An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with states and actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.
Paper Structure (24 sections, 19 theorems, 50 equations)

This paper contains 24 sections, 19 theorems, 50 equations.

Key Result

Lemma 3.1

Assume a goal represented by the maximization of a value function $V_\pi(x)$. We say that a policy $\pi_o$ is optimal for this value function in environment $x$ if $V_{\pi_o}(x)\geq V_{\pi_i}(x) \forall i$. Let If Then,

Theorems & Definitions (44)

  • Lemma 3.1
  • proof
  • Theorem 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • Lemma 3.5
  • proof
  • ...and 34 more