Table of Contents
Fetching ...

Recursively-Constrained Partially Observable Markov Decision Processes

Qi Heng Ho, Tyler Becker, Benjamin Kraske, Zakariya Laouar, Martin S. Feather, Federico Rossi, Morteza Lahijanian, Zachary N. Sunberg

TL;DR

Constrained POMDPs can violate the optimal substructure property, leading to unsafe or myopic policies under partial observability. The authors propose Recursively-Constrained POMDPs (RC-POMDPs) with history-dependent cost bounds, prove deterministic optimal policies exist and Bellman consistency holds, and develop ARCS, a point-based DP algorithm that produces admissible, near-optimal policies. Theoretical results show RC-POMDPs admit a contraction Bellman operator on an augmented belief state and ensure a unique fixed point, while experiments across multiple benchmarks demonstrate RC-POMDP policies avoid stochastic self-destruction and match or exceed CP-POMDP performance on key metrics. Overall, RC-POMDPs offer a robust framework for safe, multi-objective planning under partial observability with practical DP-based solution methods and strong empirical support.

Abstract

Many sequential decision problems involve optimizing one objective function while imposing constraints on other objectives. Constrained Partially Observable Markov Decision Processes (C-POMDP) model this case with transition uncertainty and partial observability. In this work, we first show that C-POMDPs violate the optimal substructure property over successive decision steps and thus may exhibit behaviors that are undesirable for some (e.g., safety critical) applications. Additionally, online re-planning in C-POMDPs is often ineffective due to the inconsistency resulting from this violation. To address these drawbacks, we introduce the Recursively-Constrained POMDP (RC-POMDP), which imposes additional history-dependent cost constraints on the C-POMDP. We show that, unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies and that optimal policies obey Bellman's principle of optimality. We also present a point-based dynamic programming algorithm for RC-POMDPs. Evaluations on benchmark problems demonstrate the efficacy of our algorithm and show that policies for RC-POMDPs produce more desirable behaviors than policies for C-POMDPs.

Recursively-Constrained Partially Observable Markov Decision Processes

TL;DR

Constrained POMDPs can violate the optimal substructure property, leading to unsafe or myopic policies under partial observability. The authors propose Recursively-Constrained POMDPs (RC-POMDPs) with history-dependent cost bounds, prove deterministic optimal policies exist and Bellman consistency holds, and develop ARCS, a point-based DP algorithm that produces admissible, near-optimal policies. Theoretical results show RC-POMDPs admit a contraction Bellman operator on an augmented belief state and ensure a unique fixed point, while experiments across multiple benchmarks demonstrate RC-POMDP policies avoid stochastic self-destruction and match or exceed CP-POMDP performance on key metrics. Overall, RC-POMDPs offer a robust framework for safe, multi-objective planning under partial observability with practical DP-based solution methods and strong empirical support.

Abstract

Many sequential decision problems involve optimizing one objective function while imposing constraints on other objectives. Constrained Partially Observable Markov Decision Processes (C-POMDP) model this case with transition uncertainty and partial observability. In this work, we first show that C-POMDPs violate the optimal substructure property over successive decision steps and thus may exhibit behaviors that are undesirable for some (e.g., safety critical) applications. Additionally, online re-planning in C-POMDPs is often ineffective due to the inconsistency resulting from this violation. To address these drawbacks, we introduce the Recursively-Constrained POMDP (RC-POMDP), which imposes additional history-dependent cost constraints on the C-POMDP. We show that, unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies and that optimal policies obey Bellman's principle of optimality. We also present a point-based dynamic programming algorithm for RC-POMDPs. Evaluations on benchmark problems demonstrate the efficacy of our algorithm and show that policies for RC-POMDPs produce more desirable behaviors than policies for C-POMDPs.
Paper Structure (31 sections, 9 theorems, 47 equations, 2 figures, 3 tables, 5 algorithms)

This paper contains 31 sections, 9 theorems, 47 equations, 2 figures, 3 tables, 5 algorithms.

Key Result

Proposition 1

Problem prob: rcpomdp can be rewritten as: where $d(h_t)$ is defined recursively in Eq. eq:history dependent cost recursive.

Figures (2)

  • Figure 1: Counter-example POMDP with associated reward and cost functions. The action at $b_3$ has $0$ reward and cost.
  • Figure 2: Tunnels. There is a cost of $1$ for rock traversal (red regions) and $0.5$ for backtracking. Trajectories from CGCP (blue) and ARCS (green) are displayed, with opacity approximately proportional to frequency of trajectories.

Theorems & Definitions (23)

  • Example 1: Cave Navigation
  • Definition 1: POMDP
  • Definition 2: C-POMDP
  • Remark 1
  • Definition 3: Admissible Policy
  • Remark 2
  • Proposition 1
  • Theorem 1
  • Proposition 2: Belief-Admissible Cost Formulation
  • Theorem 2
  • ...and 13 more