No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification

Andrey Zhitnikov; Ori Sztyglic; Vadim Indelman

No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification

Andrey Zhitnikov, Ori Sztyglic, Vadim Indelman

TL;DR

This paper presents a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree and MCTS that constructs the belief tree on the fly using an exploration technique.

Abstract

Continuous POMDPs with general belief-dependent rewards are notoriously difficult to solve online. In this paper, we present a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree and MCTS that constructs the belief tree on the fly using an exploration technique. Our theory allows to accelerate POMDP planning with belief-dependent rewards without any sacrifice in the quality of the obtained solution. We rigorously prove each theoretical claim in the proposed unified theory. Using the general theoretical results, we present three algorithms to accelerate continuous POMDP online planning with belief-dependent rewards. Our two algorithms, SITH-BSP and LAZY-SITH-BSP, can be utilized on top of any method that constructs a belief tree externally. The third algorithm, SITH-PFT, is an anytime MCTS method that permits to plug-in any exploration technique. All our methods are guaranteed to return exactly the same optimal action as their unsimplified equivalents. We replace the costly computation of information-theoretic rewards with novel adaptive upper and lower bounds which we derive in this paper, and are of independent interest. We show that they are easy to calculate and can be tightened by the demand of our algorithms. Our approach is general; namely, any bounds that monotonically converge to the reward can be utilized to achieve significant speedup without any loss in performance. Our theory and algorithms support the challenging setting of continuous states, actions, and observations. The beliefs can be parametric or general and represented by weighted particles. We demonstrate in simulation a significant speedup in planning compared to baseline approaches with guaranteed identical performance.

No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification

TL;DR

Abstract

Paper Structure (60 sections, 7 theorems, 89 equations, 31 figures, 8 tables, 9 algorithms)

This paper contains 60 sections, 7 theorems, 89 equations, 31 figures, 8 tables, 9 algorithms.

Introduction
Related Work
Contributions
Paper Organization
Background
POMDPs with Belief-dependent Rewards
Theoretical Objective
Estimated Objective
Objective Estimator in Case of a Given Belief Tree
Interchangeability Between the history and Belief
Coupled Action-Value Function Estimation and Belief Tree Construction
Our Approach
Theoretical Simplification Formulation
Bounds over the Estimated Objective
Multi-Level Simplification
...and 45 more sections

Key Result

Theorem 1

If the bounds over the reward are monotonic (assumption assumption:monotonic) and convergent (assumption assumption:convergence), for both estimators eq:SampleQboundsBellmanGivenTree and eq:SampleQboundsBellmanMCTS, the bounds on the sample approximation eq:BoundActionValueSampleGeneral are monotoni Similarly for Optimal value function the equality $\underline{\hat{V}}(\cdot) = \hat{V}(\cdot) =

Figures (31)

Figure 1: Schematic visualization of the belief tree and the inplace simplification. The superscript in this visualization denotes the index in the belief tree. By $b^s$ we denote the simplified version of the belief $b$.
Figure 2: Reward bounds and different levels of the simplification. Here $n_{\mathrm{max}} = 5$. Warmer colors visualize tighter bounds. Whereas colder colors (blue) indicate looser bounds and cheaper to calculate.
Figure 3:
Figure 4:
Figure 5:
...and 26 more figures

Theorems & Definitions (9)

Definition 1: Resimplification strategy
Theorem 1: Monotonicity and Convergence of Estimated Objective Function Bounds
Definition 2: Tree consistent algorithms
Lemma 1: Validity of the suggested resimplification strategy
Lemma 2: Monotonicity and convergence of UCB bounds
Theorem 2
Theorem 3
Theorem 4: Adaptive bounds on differential entropy estimator
Theorem 5: Monotonicity and convergence

No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification

TL;DR

Abstract

No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (31)

Theorems & Definitions (9)