Table of Contents
Fetching ...

Periodic agent-state based Q-learning for POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan

TL;DR

This work rigorously establishes that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy, and presents a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Abstract

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Periodic agent-state based Q-learning for POMDPs

TL;DR

This work rigorously establishes that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy, and presents a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Abstract

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.
Paper Structure (38 sections, 10 theorems, 82 equations, 9 figures, 3 tables)

This paper contains 38 sections, 10 theorems, 82 equations, 9 figures, 3 tables.

Key Result

lemma 1

For any behavior policy $\mu$, the process $\{(S_t, Z_t)\}_{t \ge 1}$ is Markov. Therefore, the processes $\{(S_t, Z_t, A_t)\}_{t \ge 1}$ and $\{(S_t, Y_t, Z_t, A_t)\}_{t \ge 1}$ are also Markov.

Figures (9)

  • Figure 1: The cells indicate the state of the environment. Cells with the same background color have the same observation. The cells with a thick red boundary correspond to elements of the set $\mathsf D_0 \coloneqq \{ n(n+1)/2 + 1 : n \in \mathds{N} \}$, where the action $0$ gives a reward of $+1$ and moves the state to the right, while the action $1$ gives a reward of $-1$ and resets the state to $1$. The cells with a thin black boundary correspond to elements of the set $\mathsf D_1 = \mathds{N} \setminus \mathsf D_0$, where the action $1$ gives the reward of $+1$ and moves the state to the right while the action $0$ gives a reward of $-1$ and resets the state to $1$. Discount factor $\gamma = 0.9$.
  • Figure 2: The model for \ref{['ex:PASQL-example']}, where states which have the same color give the same observation; the green edges give a reward of $+1$ and blue edges give a reward of $+0.5$.
  • Figure 3: \ref{['eq:PASQL']} iterates for different behavioral policies (in blue) and the limit predicted by \ref{['thm:convergence']} (in red).
  • Figure 4: A T-shaped grid world. Agent starts at $\textsc{s}$, where it learns whether the goal state is $\textsc{g}_1$ or $\textsc{g}_2$. It has to go through the corridor $\{1,\dots,2n\}$, without knowing where it is, reach $\textsc{t}$ and go up or down to reach the goal state.
  • Figure 5: \ref{['eq:ASQL']} iterates for different behavioral policies (in blue) and the limit predicted by \ref{['thm:convergence']} (in red).
  • ...and 4 more figures

Theorems & Definitions (17)

  • lemma 1
  • theorem 1
  • theorem 2
  • theorem 3: Strong law of large numbers for Markov chains, Theorem 5.6.1 of Durrett2019
  • proposition 1
  • proof
  • proposition 2
  • proof
  • proposition 3
  • proof
  • ...and 7 more