Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives
Qi Heng Ho, Martin S. Feather, Federico Rossi, Zachary N. Sunberg, Morteza Lahijanian
TL;DR
This paper tackles the undiscounted Maximal Reachability Probability Problem (MRPP) in POMDPs by extending trial-based belief-search methods to provide two-sided probability bounds. The authors identify fundamental issues with applying discounted-sum methods to MRPP (including incorrect convergence and loop-induced non-termination) and introduce HSVI-RP, a graph-based, trial-driven algorithm that maintains sound lower and upper bounds and uses adaptive depth, UCB-inspired action selection, and exact upper-bound value iteration to handle end components. They prove asymptotic convergence of the lower bound under a finite-memory feasibility assumption and demonstrate through benchmarks that HSVI-RP often yields tighter bounds with competitive computation times compared to state-of-the-art belied-based methods and FSC-based approaches. The work advances practical MRPP verification and policy synthesis for POMDPs, enabling near-optimal guarantees and applicability to temporal logic specifications via product constructs.
Abstract
Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making under transition and observation uncertainties. This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem (MRPP), where the goal is to maximize the probability of reaching some target states. This is also a core problem in model checking with logical specifications and is naturally undiscounted (discount factor is one). Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. Specifically, we focus on trial-based heuristic search value iteration techniques and present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space (informed search via value bounds) while addressing their drawbacks in handling loops for indefinite-horizon problems. The algorithm produces policies with two-sided bounds on optimal reachability probabilities. We prove convergence to an optimal policy from below under certain conditions. Experimental evaluations on a suite of benchmarks show that our algorithm outperforms existing methods in almost all cases in both probability guarantees and computation time.
