Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

Idan Lev-Yehudi; Moran Barenboim; Vadim Indelman

Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

Idan Lev-Yehudi, Moran Barenboim, Vadim Indelman

TL;DR

The paper addresses planning in continuous POMDPs with high-dimensional observations by replacing expensive observation models with a cheaper surrogate during planning while providing probabilistic guarantees on performance. The core idea is a state-dependent total variation bound, $\,\Delta_Z(x)$, that links the true value under $p_Z$ to the value under a simplified model $q_Z$, and an offline/online computation scheme that yields guaranteed bounds without online access to $p_Z$. It introduces a non-parametric local bound via $m_i$ and a cumulative bound $M_t^{\pi}$ (and the action-bound analog $\u001bPhi_t^{\pi}$), together with an online estimator $ ilde{m}_i$ based on pre-sampled delta-states and importance sampling. Theoretical convergence results extend PB-MDP concentration bounds to general policies, and a detailed 2D beacons simulation demonstrates reduced planning time and meaningful policy differences induced by the bounds, implying practical utility for real-time planning with visual observations and potential for runtime pruning and certification.

Abstract

Solving partially observable Markov decision processes (POMDPs) with high dimensional and continuous observations, such as camera images, is required for many real life robotics and planning problems. Recent researches suggested machine learned probabilistic models as observation models, but their use is currently too computationally expensive for online deployment. We deal with the question of what would be the implication of using simplified observation models for planning, while retaining formal guarantees on the quality of the solution. Our main contribution is a novel probabilistic bound based on a statistical total variation distance of the simplified model. We show that it bounds the theoretical POMDP value w.r.t. original model, from the empirical planned value with the simplified model, by generalizing recent results of particle-belief MDP concentration bounds. Our calculations can be separated into offline and online parts, and we arrive at formal guarantees without having to access the costly model at all during planning, which is also a novel result. Finally, we demonstrate in simulation how to integrate the bound into the routine of an existing continuous online POMDP solver.

Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

TL;DR

, that links the true value under

to the value under a simplified model

, and an offline/online computation scheme that yields guaranteed bounds without online access to

. It introduces a non-parametric local bound via

and a cumulative bound

(and the action-bound analog

), together with an online estimator

based on pre-sampled delta-states and importance sampling. Theoretical convergence results extend PB-MDP concentration bounds to general policies, and a detailed 2D beacons simulation demonstrates reduced planning time and meaningful policy differences induced by the bounds, implying practical utility for real-time planning with visual observations and potential for runtime pruning and certification.

Abstract

Paper Structure (40 sections, 16 theorems, 105 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 40 sections, 16 theorems, 105 equations, 5 figures, 1 table, 3 algorithms.

Introduction
Contribution
Related Work
Simplification of Probabilistic Models in POMDP Planning
Continuous POMDP Planning With Guarantees
Preliminaries
Methodology
Problem Formulation
Bound for Simplified Observation Model
Equivalence to State-Action Local Bound
Online Estimator of Local Bound
Convergence Guarantees
Implementation
Simulative Setting
Implementation of Bounds
...and 25 more sections

Key Result

Lemma 1

The expected belief-dependent reward w.r.t. histories, is equivalent to the expected state-dependent reward w.r.t. the joint distribution of states and observations.

Figures (5)

Figure 1: An illustration of a planning session with a simplified observation model. The scattered dots are the pre-sampled states, and the dot size is relative to $\Delta_{Z}$, the estimated discrepancy between the simplified and original observation models. The simplified observation model is less accurate on the bottom where the surroundings are more visually complex. For the two policies, we compute the bound as a summation over $\Delta_{Z}$ weighted by the transition model. We bound the summation to a truncation distance indicated by the cyan circles, and $\Delta_{Z}$ within it is marked in red. The bottom policy chooses actions that give higher weights to states with greater $\Delta_{Z}$, resulting in looser bounds.
Figure 2: The various relationships between the action value functions in Corollary \ref{['crl:JointApproximationBound']} for probably-approximately bounding $\lvert{Q_{\mathbf{P}}^{p_Z}-\hat{Q}_{\mathbf{M_{P}}}^{q_Z}}\rvert\leq \hat{\Phi}_{\mathbf{M_{P}}}$. (A) is given by Theorem \ref{['thm:LocalActionBound']}, connecting theoretical value functions with the theoretical local state bound. (B) is given by Corollary \ref{['crl:ArbitraryPrecisionBounds']}, connecting theoretical action value functions with their PB-MDP approximation. (C) is given by any planner with performance guarantees, such as POWSS, approximating the PB-MDP values.
Figure 3: The results of two planning sessions in 2D beacons. The goal is indicated by the blue rectangle, the beacons and their radii by the green squares and circles, and the outer walls by the grey outer rectangle. The filtered delta states $\{x_n^{\Delta}\}_{n=1}^{N_{\Delta}^{\textit{kept}}}$ are indicated by the tri-downs, with color relative to estimated TV-distance of the simplified observation model $7.06\leq\hat{\Delta}_{Z} \cdotp10^{2}\leq 12.08$. The colored dots indicate: the true state in black, the observation in red and the particle belief in purple. The belief empirical mean and covariance are the grey ellipse. The bars on the right depict $\hat{Q}_t^{q_Z}$ for all actions, and $\hat{\Phi}_t$ as symmetric error bars. The action chosen by the simplified value policy $\pi^{q_Z}_t$ is colored in green, and the lower bound policy $\pi^{\mathcal{LB}}_t$ in blue if different. At $t=4$ we can see an inconsistency of action order between $\pi^{q_Z}_t$ that chooses left, whereas $\pi^{\mathcal{LB}}_t$ chooses down.
Figure 4: Mean and standard deviation of planning duration over 100 scenarios vs. scenario time step, with the original observation model $p_Z$ or simplified model $q_Z$.
Figure 5: Percentage of scenarios in each time step in which the lower or upper bound policies, $\pi^{\mathcal{LB}}$ or $\pi^{\mathcal{UB}}$, chose an action different from the simplified value policy $\pi^{q_Z}$.

Theorems & Definitions (24)

Lemma 1
Theorem 1
Corollary 1
Theorem 2
Theorem 3: Generalized PB-MDP Convergence
Corollary 2
Corollary 3
Lemma 1
proof
Theorem 1
...and 14 more

Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

TL;DR

Abstract

Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (24)