Periodic agent-state based Q-learning for POMDPs

Amit Sinha; Matthieu Geist; Aditya Mahajan

Periodic agent-state based Q-learning for POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan

TL;DR

This work rigorously establishes that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy, and presents a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Abstract

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Periodic agent-state based Q-learning for POMDPs

TL;DR

Abstract

Paper Structure (38 sections, 10 theorems, 82 equations, 9 figures, 3 tables)

This paper contains 38 sections, 10 theorems, 82 equations, 9 figures, 3 tables.

Introduction
Periodic agent-state based Q-learning (PASQL) with agent state
Model for POMDPs
PASQL: Periodic agent-state based Q-learning algorithm for POMDPs
Establishing the convergence of tabular PASQL
Characterizing the optimality-gap of the converged limit
Numerical experiments
Related work
Discussion
Illustrative examples
\ref{['ex:PASQL-example']}: Learning curves for \ref{['eq:ASQL']}
\ref{['ex:1']}: non-stationary policies can outperform stationary policies
\ref{['ex:2']}: stochastic policies can outperform deterministic policies
\ref{['ex:3']}: conceptual difference between state-augmentation and periodic policies
Periodic Markov chains
...and 23 more sections

Key Result

lemma 1

For any behavior policy $\mu$, the process $\{(S_t, Z_t)\}_{t \ge 1}$ is Markov. Therefore, the processes $\{(S_t, Z_t, A_t)\}_{t \ge 1}$ and $\{(S_t, Y_t, Z_t, A_t)\}_{t \ge 1}$ are also Markov.

Figures (9)

Figure 1: The cells indicate the state of the environment. Cells with the same background color have the same observation. The cells with a thick red boundary correspond to elements of the set $\mathsf D_0 \coloneqq \{ n(n+1)/2 + 1 : n \in \mathds{N} \}$, where the action $0$ gives a reward of $+1$ and moves the state to the right, while the action $1$ gives a reward of $-1$ and resets the state to $1$. The cells with a thin black boundary correspond to elements of the set $\mathsf D_1 = \mathds{N} \setminus \mathsf D_0$, where the action $1$ gives the reward of $+1$ and moves the state to the right while the action $0$ gives a reward of $-1$ and resets the state to $1$. Discount factor $\gamma = 0.9$.
Figure 2: The model for \ref{['ex:PASQL-example']}, where states which have the same color give the same observation; the green edges give a reward of $+1$ and blue edges give a reward of $+0.5$.
Figure 3: \ref{['eq:PASQL']} iterates for different behavioral policies (in blue) and the limit predicted by \ref{['thm:convergence']} (in red).
Figure 4: A T-shaped grid world. Agent starts at $\textsc{s}$, where it learns whether the goal state is $\textsc{g}_1$ or $\textsc{g}_2$. It has to go through the corridor $\{1,\dots,2n\}$, without knowing where it is, reach $\textsc{t}$ and go up or down to reach the goal state.
Figure 5: \ref{['eq:ASQL']} iterates for different behavioral policies (in blue) and the limit predicted by \ref{['thm:convergence']} (in red).
...and 4 more figures

Theorems & Definitions (17)

lemma 1
theorem 1
theorem 2
theorem 3: Strong law of large numbers for Markov chains, Theorem 5.6.1 of Durrett2019
proposition 1
proof
proposition 2
proof
proposition 3
proof
...and 7 more

Periodic agent-state based Q-learning for POMDPs

TL;DR

Abstract

Periodic agent-state based Q-learning for POMDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (17)