Table of Contents
Fetching ...

Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

Philip Jordan, Maryam Kamgarpour

Abstract

We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.

Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

Abstract

We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.

Paper Structure

This paper contains 15 sections, 5 theorems, 36 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1

Let Assumptions ass:uni-min and ass:uni-obs hold. For $\epsilon > 0$, $\delta > 0$, and $m \in \mathbb{N}$, suppose we run Algorithm alg:main with Then with probability $1-\delta$, the output policy $\pi^m$ satisfies where $\rho=S \alpha \beta$. $\blacktriangleleft$$\blacktriangleleft$

Figures (1)

  • Figure 3: Empirical convergence of $V^m(\pi^m)$ towards $V^m_\star$ for $m \in \left\{ 1,\dots,5 \right\}$ when running Algorithm \ref{['alg:main']} in the Probe environment.

Theorems & Definitions (10)

  • Theorem 1
  • Definition 1
  • Proposition 1
  • Lemma 1
  • Proposition 2
  • proof
  • proof : Proof of Claim
  • proof : Proof of Theorem \ref{['thm:main']}
  • proof : Proof of Proposition \ref{['prop:uni-min-stab']}
  • Theorem 2