Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

Qi Heng Ho; Martin S. Feather; Federico Rossi; Zachary N. Sunberg; Morteza Lahijanian

Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

Qi Heng Ho, Martin S. Feather, Federico Rossi, Zachary N. Sunberg, Morteza Lahijanian

TL;DR

This paper tackles the undiscounted Maximal Reachability Probability Problem (MRPP) in POMDPs by extending trial-based belief-search methods to provide two-sided probability bounds. The authors identify fundamental issues with applying discounted-sum methods to MRPP (including incorrect convergence and loop-induced non-termination) and introduce HSVI-RP, a graph-based, trial-driven algorithm that maintains sound lower and upper bounds and uses adaptive depth, UCB-inspired action selection, and exact upper-bound value iteration to handle end components. They prove asymptotic convergence of the lower bound under a finite-memory feasibility assumption and demonstrate through benchmarks that HSVI-RP often yields tighter bounds with competitive computation times compared to state-of-the-art belied-based methods and FSC-based approaches. The work advances practical MRPP verification and policy synthesis for POMDPs, enabling near-optimal guarantees and applicability to temporal logic specifications via product constructs.

Abstract

Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making under transition and observation uncertainties. This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem (MRPP), where the goal is to maximize the probability of reaching some target states. This is also a core problem in model checking with logical specifications and is naturally undiscounted (discount factor is one). Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. Specifically, we focus on trial-based heuristic search value iteration techniques and present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space (informed search via value bounds) while addressing their drawbacks in handling loops for indefinite-horizon problems. The algorithm produces policies with two-sided bounds on optimal reachability probabilities. We prove convergence to an optimal policy from below under certain conditions. Experimental evaluations on a suite of benchmarks show that our algorithm outperforms existing methods in almost all cases in both probability guarantees and computation time.

Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

TL;DR

Abstract

Paper Structure (32 sections, 3 theorems, 19 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 32 sections, 3 theorems, 19 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Preliminaries and Problem Formulation
From Discounted-Sum to Reachability Probabilities
Incorrect converged solution if $\gamma < 1$ is used.
Trials may not terminate for $\gamma = 1$.
Loops.
Heuristic Graph Search Value Iteration
Lower and Upper Bounds
Value Updates
Periodic Exact Upper Bound Value Iteration
Trial-based Graph Exploration
Pruning
Empirical Evaluation
Q3: Comparison to other approaches.
Conclusion
...and 17 more sections

Key Result

Proposition 1

The optimal value $V^{\pi^\gamma}(b_0)$ computed via Eq. eq:probability expected total reward with discount $\gamma < 1$ strictly under-approximates the optimal probability value $P^{\pi^*}_{\mathcal{M}}(\lozenge \mathrm{T})$ for Problem problem if it takes more than one time step to reach $\mathrm{

Figures (3)

Figure 1: Belief MDP with loops that HSVI2 is ineffective on. The lower and upper bound values are initially $[0, 1]$ for all belief states. $b_4$ is initially not yet explored.
Figure 2: Overview of HSVI-RP. The algorithm incrementally constructs a belief graph using trial-based search, and maintains upper and lower bound values for each belief node. In each trial, actions and observations are selected using a heuristic based on the upper and lower bound values to visit a sequence of belief nodes (orange). After each trial, value bounds for each visited node are updated using local Bellman backups. Every $n$ iterations, upper bound values for all nodes are reset, and frontier nodes (purple) are re-initialized using the upper bound value set $\Upsilon^U$, and value iteration is performed. This allows better improvement of upper bound values for MRPP.
Figure 3: Evolution of lower and upper bound values over time. Overapp computes upper bounds, while STORM, PAYNT, and SAYNT compute lower bounds.

Theorems & Definitions (10)

Definition 1: POMDP
Remark 1
Remark 2
Proposition 1
proof
Remark 3: Observation Heuristic Randomization
Lemma 1: Soundness
Theorem 1: Asymptotic Convergence
proof
proof

Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

TL;DR

Abstract

Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (10)