Metareasoning in uncertain environments: a meta-BAMDP framework

Prakhar Godara; Tilman Diego Aléman; Angela J. Yu

Metareasoning in uncertain environments: a meta-BAMDP framework

Prakhar Godara, Tilman Diego Aléman, Angela J. Yu

TL;DR

This work addresses metareasoning in environments with unknown reward and transition dynamics by introducing the meta-BAMDP framework, extending Bayes-Adaptive MDPs to incorporate planning uncertainty. It specializes the framework to the $N$-armed Bernoulli bandit task, defining states as $(s,b,\tilde{b})$ with physical and computational actions and a planning-to-action mapping $\mathcal{K}$, and derives tractable approximations through two pruning theorems. The resulting theory yields normative predictions about when agents should invest computational effort, how computation shifts exploration, and how cognitive costs shape behavior in bandit-like decisions, aligning with observed human data in tasks with cognitive load. Overall, the paper offers a resource-rational, testable framework for understanding exploration under computational constraints and provides scalable methods for meta-reasoning in uncertain environments, with broad implications for AI planning and cognitive modeling.

Abstract

\textit{Reasoning} may be viewed as an algorithm $P$ that makes a choice of an action $a^* \in \mathcal{A}$, aiming to optimize some outcome. However, executing $P$ itself bears costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by making the choice in the underlying decision problem. Finding the right $P$ can itself be framed as an optimization problem over the space of reasoning processes $P$, generally referred to as \textit{metareasoning}. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the underlying MDP. This paper generalizes such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions, which encompasses a far larger and more realistic set of planning problems that humans and AI systems face. As a first step, we apply the framework to Bernoulli bandit tasks. Owing to the meta problem's complexity, our solutions are necessarily approximate. However, we introduce two novel theorems that significantly enhance the tractability of the problem, enabling stronger approximations that are robust within a range of assumptions grounded in realistic human decision-making scenarios. These results offer a resource-rational perspective and a normative framework for understanding human exploration under cognitive constraints, as well as providing experimentally testable predictions about human behavior in Bernoulli Bandit tasks.

Metareasoning in uncertain environments: a meta-BAMDP framework

TL;DR

-armed Bernoulli bandit task, defining states as

with physical and computational actions and a planning-to-action mapping

, and derives tractable approximations through two pruning theorems. The resulting theory yields normative predictions about when agents should invest computational effort, how computation shifts exploration, and how cognitive costs shape behavior in bandit-like decisions, aligning with observed human data in tasks with cognitive load. Overall, the paper offers a resource-rational, testable framework for understanding exploration under computational constraints and provides scalable methods for meta-reasoning in uncertain environments, with broad implications for AI planning and cognitive modeling.

Abstract

\textit{Reasoning} may be viewed as an algorithm

that makes a choice of an action

, aiming to optimize some outcome. However, executing

itself bears costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by making the choice in the underlying decision problem. Finding the right

can itself be framed as an optimization problem over the space of reasoning processes

, generally referred to as \textit{metareasoning}. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the underlying MDP. This paper generalizes such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions, which encompasses a far larger and more realistic set of planning problems that humans and AI systems face. As a first step, we apply the framework to Bernoulli bandit tasks. Owing to the meta problem's complexity, our solutions are necessarily approximate. However, we introduce two novel theorems that significantly enhance the tractability of the problem, enabling stronger approximations that are robust within a range of assumptions grounded in realistic human decision-making scenarios. These results offer a resource-rational perspective and a normative framework for understanding human exploration under cognitive constraints, as well as providing experimentally testable predictions about human behavior in Bernoulli Bandit tasks.

Paper Structure (25 sections, 6 theorems, 37 equations, 5 figures, 2 algorithms)

This paper contains 25 sections, 6 theorems, 37 equations, 5 figures, 2 algorithms.

Introduction
Related work and contributions
Background
Markov Decision Process - MDP
Bayes-Adaptive Markov Decision Process - BAMDP
Meta-Bayes-Adaptive Markov Decision Process - meta-BAMDP
A meta-BAMDP for $N$ armed Bernoulli bandit task
Finding good approximations via pruning
Implications for human exploration behavior in TABB tasks
Mapping to experimental data
Sensitivity to computational cost manipulations
Conclusions
Appendix
Pseudocode
Robustness of the solution
...and 10 more sections

Key Result

Theorem 1

The optimal meta-policy $\pi^*$ is a mind changer. I.e. if for any state $(\bm b, \tilde{b})$, $\pi^*$ prescribes performing computations till $(\bm b, \tilde{b}^\prime)$ and then terminate, then either of the following is true. Where $a_\perp(\cdot)$ represents the terminal action in state $(\cdot)$, and is obtained from the subjective value function $\mathcal{K}$ as in Eq. eq:tree_to_action.

Figures (5)

Figure 1: Schematic of a decision action tree for $N=2$ armed bandit task. Solid - current planning-belief $\tilde{b}$, dotted - unexplored subgraph, dashed - a candidate node expansion step.
Figure 2: Behavior of meta-optimal policies. (a) Normalized, total expected reward accrued under the optimal meta-policy for a given computational cost and different task lengths. (b) Average time-step (in the TABB task) at which a node-expansion action is performed, as a function of the computational cost, for tasks of different lengths. (c) Environments in which most computations are performed as a function of computational cost, for different task lengths.
Figure 3: Explaining human adaptation to computational constraints. (a) Coefficient for uncertainty based exploration (also called uncertainty bonus) for a given computational cost and different task lengths, in the environment $(p_1,p_2) = (0.5,0.5)$. (b) Action entropy as a function of computational cost in the environment $(p_1,p_2) = (0.5,0.5)$ for varying task lengths.
Figure 4: (a,b,c) Sensitivity of average time at which exploratory actions are taken, to changes in computational cost, for different task lengths. (d,e,f) Sensitivity of total expected reward to changes in computational cost, for different task lengths.
Figure 5: (a) Normalized reward gained as a function of computational cost for varying number of arms in tasks of length $T=9$. (b) Average number of computations performed as a function of computational costs for varying number of arms in tasks of length $T=9$. (c) Action entropy as a function computational costs with varying number of arms in tasks of length $T=9$ (averaged over $10^5$ simulation runs). (d) Best fit $\omega$ as a function of computational costs to behavior generated by meta-optimal policies in a symmetric environment with $p=0.5$ and $T=9$ (averaged over $10^5$ simulation runs).

Theorems & Definitions (13)

Theorem 1
Corollary 1.1
Theorem 2
Corollary 2.1
Definition 1: $\mathcal{M}$-beliefs
Corollary 2.2
proof
proof
proof
proof
...and 3 more

Metareasoning in uncertain environments: a meta-BAMDP framework

TL;DR

Abstract

Metareasoning in uncertain environments: a meta-BAMDP framework

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (13)