Table of Contents
Fetching ...

Posterior Sampling-based Online Learning for Episodic POMDPs

Dengwang Tang, Dongze Ye, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

TL;DR

A Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs and shows that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters.

Abstract

Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.

Posterior Sampling-based Online Learning for Episodic POMDPs

TL;DR

A Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs and shows that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters.

Abstract

Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length , and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
Paper Structure (34 sections, 19 theorems, 120 equations, 8 figures, 2 algorithms)

This paper contains 34 sections, 19 theorems, 120 equations, 8 figures, 2 algorithms.

Key Result

Theorem 4.1

For general POMDP learning problems, the Bayesian regret under the PS4POMDPs algorithm satisfies

Figures (8)

  • Figure 1: Cumulative expected regret on Tiger with 95% confidence interval over 20 runs. (a) Left: Regret divided by the number of episodes $K$. (b) Right: Divided by $\sqrt{K}$.
  • Figure 2: Cumulative expected regret on RiverSwim with 95% confidence interval over 20 runs. (a) Left: Regret divided by the number of episodes $K$. (b) Right: Divided by $\sqrt{K}$.
  • Figure 3: Proof road map of Theorem \ref{['thm:breg0']} (dashed) and Theorem \ref{['thm:breg1']} (solid). For Theorem \ref{['thm:breg1']}, the bound is developed via an auxiliary quantity called the projected operator distances. In this diagram, $A\rightarrow B$ means "A is upper bounded by some function of B." While "TR" stands for "trajectory-based", "OP" stands for "operator-based."
  • Figure 4: Performance of PS4POMDPs on the Tiger problem. The expected return for a policy is computed exactly by consider evaluating the policy on the corresponding belief-MDPs. All results are averaged over 20 independent runs (with randomly generated random seeds).
  • Figure 5: Detailed results for the first run of PS4POMDPs in Tiger with $\theta^* = 0.3$
  • ...and 3 more figures

Theorems & Definitions (33)

  • Remark 3.2
  • Theorem 4.1
  • Remark 4.2
  • Proposition 4.3
  • Theorem 4.4
  • Remark 4.5
  • Corollary 4.6
  • proof
  • Lemma 5.1
  • Lemma 5.2
  • ...and 23 more