Agent-state based policies in POMDPs: Beyond belief-state MDPs

Amit Sinha; Aditya Mahajan

Agent-state based policies in POMDPs: Beyond belief-state MDPs

Amit Sinha, Aditya Mahajan

TL;DR

A unified treatment of some approaches to convert POMDPs into fully observed MDPs by viewing them as models where the agent maintains a local recursively updateable “agent state” and chooses actions based on the agent state is presented.

Abstract

The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable agent state and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer's approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.

Agent-state based policies in POMDPs: Beyond belief-state MDPs

TL;DR

Abstract

Paper Structure (15 sections, 2 theorems, 43 equations, 2 figures, 2 tables)

This paper contains 15 sections, 2 theorems, 43 equations, 2 figures, 2 tables.

Introduction
The POMDP model and agent-state based policies
System model
Some remarks on the model
Agent-state based policies
Information state
Optimal agent-state based policies
Policy classes
The designer's approach to find the optimal non-stationary policy
Policy search methods to find a locally optimal policy in $\Pi_{\textup{ss}}$
The approximate information state approach to find a good policy in $\Pi_{\textup{sd}}$
Reinforcement learning approaches to learn a policy in $\Pi_{\textup{sd}}$ or $\Pi_{\textup{ss}}$
Agent-state based Q-learning (ASQL)
Agent-state based actor critic (ASAC)
Conclusion

Key Result

proposition 1

The process $\{\xi^{\pi}_t\}_{t \ge 1}$ is a controlled Markov process controlled by $\{\pi_t\}_{t \ge 1}$, i.e.,

Figures (2)

Figure 1: The cells indicate the state of the environment. Cells with the same background color have the same observation. The cells with a thick red boundary correspond to elements of the set $\mathcal{D}_0 \coloneqq \{ n(n+1)/2 + 1 : n \in \mathds{N} \}$, where the action $0$ gives a reward of $+1$ and moves the state to the right, while the action $1$ gives a reward of $-1$ and resets the state to $1$. The cells with a thin black boundary correspond to elements of the set $\mathcal{D}_1 = \mathds{N} \setminus \mathcal{D}_0$, where the action $1$ gives the reward of $+1$ and moves the state to the right while the action $0$ gives a reward of $-1$ and resets the state to $1$.
Figure 2: A POMDP with $\mathcal{S} = \{0,1,2\}$, $\mathcal{A} = \{0, 1\}$ and $\mathcal{Y} = \{0\}$. The rewards functions are $r(\cdot,0) = [-1,0,2]$ and $r(\cdot, 1) = -0.5$.

Theorems & Definitions (4)

proposition 1
definition 1
definition 2
corollary 1

Agent-state based policies in POMDPs: Beyond belief-state MDPs

TL;DR

Abstract

Agent-state based policies in POMDPs: Beyond belief-state MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)