Table of Contents
Fetching ...

Value of Information and Reward Specification in Active Inference and POMDPs

Ran Wei

TL;DR

The paper investigates how the active-inference objective of expected free energy (EFE) relates to Bayes-optimal reinforcement learning (RL) in POMDPs. By recasting EFE within a belief MDP and decomposing it into pragmatic and epistemic components, the authors derive a per-step belief-based reward $R^{EFE}(b,a)$ and an open-loop belief dynamics, enabling a direct comparison with RL policies. They show that the epistemic term in EFE provides a principled information-gain mechanism that can approximate the Bayes-optimal RL policy, and they prove a regret-style bound that the performance gap to Bayes-optimal RL scales linearly with horizon and is reduced by information gain. The results offer guidance for objective specification in active inference, notably the need to balance reward and information gain via a temperature parameter, and situate EFE as a Bayes-optimal design in a broad RL/POMDP context.

Abstract

Expected free energy (EFE) is a central quantity in active inference which has recently gained popularity due to its intuitive decomposition of the expected value of control into a pragmatic and an epistemic component. While numerous conjectures have been made to justify EFE as a decision making objective function, the most widely accepted is still its intuitiveness and resemblance to variational free energy in approximate Bayesian inference. In this work, we take a bottom up approach and ask: taking EFE as given, what's the resulting agent's optimality gap compared with a reward-driven reinforcement learning (RL) agent, which is well understood? By casting EFE under a particular class of belief MDP and using analysis tools from RL theory, we show that EFE approximates the Bayes optimal RL policy via information value. We discuss the implications for objective specification of active inference agents.

Value of Information and Reward Specification in Active Inference and POMDPs

TL;DR

The paper investigates how the active-inference objective of expected free energy (EFE) relates to Bayes-optimal reinforcement learning (RL) in POMDPs. By recasting EFE within a belief MDP and decomposing it into pragmatic and epistemic components, the authors derive a per-step belief-based reward and an open-loop belief dynamics, enabling a direct comparison with RL policies. They show that the epistemic term in EFE provides a principled information-gain mechanism that can approximate the Bayes-optimal RL policy, and they prove a regret-style bound that the performance gap to Bayes-optimal RL scales linearly with horizon and is reduced by information gain. The results offer guidance for objective specification in active inference, notably the need to balance reward and information gain via a temperature parameter, and situate EFE as a Bayes-optimal design in a broad RL/POMDP context.

Abstract

Expected free energy (EFE) is a central quantity in active inference which has recently gained popularity due to its intuitive decomposition of the expected value of control into a pragmatic and an epistemic component. While numerous conjectures have been made to justify EFE as a decision making objective function, the most widely accepted is still its intuitiveness and resemblance to variational free energy in approximate Bayesian inference. In this work, we take a bottom up approach and ask: taking EFE as given, what's the resulting agent's optimality gap compared with a reward-driven reinforcement learning (RL) agent, which is well understood? By casting EFE under a particular class of belief MDP and using analysis tools from RL theory, we show that EFE approximates the Bayes optimal RL policy via information value. We discuss the implications for objective specification of active inference agents.
Paper Structure (26 sections, 20 theorems, 98 equations)

This paper contains 26 sections, 20 theorems, 98 equations.

Key Result

Proposition 3.1

(Active inference policy) The EFE achieved by the optimal action sequence can be equivalently achieved by a time-indexed belief-action policy $\pi(a_{t}|b_{t})$.

Theorems & Definitions (35)

  • Proposition 3.1
  • proof
  • Proposition 3.2
  • Lemma 4.1
  • Lemma 4.2
  • Proposition 5.1
  • Proposition 5.2
  • Theorem 5.5
  • Proposition A.1
  • proof
  • ...and 25 more