Table of Contents
Fetching ...

Revelations: A Decidable Class of POMDPs with Omega-Regular Objectives

Marius Belly, Nathanaël Fijalkow, Hugo Gimbert, Florian Horn, Guillermo A. Pérez, Pierre Vandenhove

TL;DR

This work addresses the decidability gap for almost-sure $\\omega$-regular (parity) objectives in POMDPs by introducing a revelation mechanism that enforces eventual full information. It defines weakly revealing and strongly revealing POMDPs and shows that, under these semantics, the problem reduces to analyzing a finite belief-support MDP, enabling exact algorithms. For weakly revealing POMDPs, parity objectives with priorities in $\{0,1,2\}$ are EXPTIME-complete, while $\{1,2,3\}$ is undecidable; strongly revealing POMDPs yield EXPTIME-complete decidability for parity, with two-player CoBüchi games remaining undecidable, and an optimistic (POMDP$_{sr}$) perspective broadens the practical relevance. An implementation sketch and comparisons to DRL highlight the practicality of the belief-support approach, illustrating a precise frontier between decidability and undecidability in POMDPs under revealing semantics.

Abstract

Partially observable Markov decision processes (POMDPs) form a prominent model for uncertainty in sequential decision making. We are interested in constructing algorithms with theoretical guarantees to determine whether the agent has a strategy ensuring a given specification with probability 1. This well-studied problem is known to be undecidable already for very simple omega-regular objectives, because of the difficulty of reasoning on uncertain events. We introduce a revelation mechanism which restricts information loss by requiring that almost surely the agent has eventually full information of the current state. Our main technical results are to construct exact algorithms for two classes of POMDPs called weakly and strongly revealing. Importantly, the decidable cases reduce to the analysis of a finite belief-support Markov decision process. This yields a conceptually simple and exact algorithm for a large class of POMDPs.

Revelations: A Decidable Class of POMDPs with Omega-Regular Objectives

TL;DR

This work addresses the decidability gap for almost-sure -regular (parity) objectives in POMDPs by introducing a revelation mechanism that enforces eventual full information. It defines weakly revealing and strongly revealing POMDPs and shows that, under these semantics, the problem reduces to analyzing a finite belief-support MDP, enabling exact algorithms. For weakly revealing POMDPs, parity objectives with priorities in are EXPTIME-complete, while is undecidable; strongly revealing POMDPs yield EXPTIME-complete decidability for parity, with two-player CoBüchi games remaining undecidable, and an optimistic (POMDP) perspective broadens the practical relevance. An implementation sketch and comparisons to DRL highlight the practicality of the belief-support approach, illustrating a precise frontier between decidability and undecidability in POMDPs under revealing semantics.

Abstract

Partially observable Markov decision processes (POMDPs) form a prominent model for uncertainty in sequential decision making. We are interested in constructing algorithms with theoretical guarantees to determine whether the agent has a strategy ensuring a given specification with probability 1. This well-studied problem is known to be undecidable already for very simple omega-regular objectives, because of the difficulty of reasoning on uncertain events. We introduce a revelation mechanism which restricts information loss by requiring that almost surely the agent has eventually full information of the current state. Our main technical results are to construct exact algorithms for two classes of POMDPs called weakly and strongly revealing. Importantly, the decidable cases reduce to the analysis of a finite belief-support Markov decision process. This yields a conceptually simple and exact algorithm for a large class of POMDPs.

Paper Structure

This paper contains 23 sections, 28 theorems, 7 equations, 10 figures.

Key Result

Proposition 1

Let $\mathcal{P} = (Q, \mathsf{Act}, \mathsf{Sig}, \delta, q_0)$ be a weakly revealing POMDP with priority function $p$, and let $\mathcal{P}_\mathcal{B}$ be its belief-support MDP with priority function $p_\mathcal{B}$. Assume there is an almost-sure strategy $\sigma_\mathcal{B}$ for $\mathsf{Parit

Figures (10)

  • Figure 1: We consider the POMDP on the LHS: there is a single signal $s$, so no information is ever given about the exact state we are in (a behavior the revelation mechanisms forbid!). Yet, almost surely, we reach $q_1$. The priorities indicated on states constitute a parity condition inducing the objective "eventually never visiting $q_1$", which clearly cannot be ensured almost surely. We represent the belief-support MDP on the RHS: the two states are $\{q_0\}$ and $\{q_0, q_1\}$, and only the state $\{q_0, q_1\}$ is visited infinitely often. To assign priorities to the states of this MDP, there are two natural candidates: "maximal priority semantics" and "minimal priority semantics", meaning that we assign either the maximal or minimal priority from the states in the belief support. In this figure, we use the maximal priority semantics: the priority of $\{q_0, q_1\}$ is thus $2$, so the belief-support MDP is winning. This means that the analysis of the belief-support MDP is not sound in general. By tweaking the priorities in this example, one can show that both priority semantics are neither sound nor complete.
  • Figure 2: Summary of our results: decidable subclasses of the parity objective depending on the revelation mechanism.
  • Figure 3: Omega-regular specifications have a natural interpretation in terms of bad events that must all be trumped by future good events. Along a simulation of the POMDP, one can keep track of the number of steps from the last bad event that has not yet been trumped (i.e., lower is better). Here, we depict this value, per step (from $1$ to $500$) over $500$ simulations of a revealing version of the classical tiger POMDP tiger94. A2C, DQN, and PPO are (MlpPolicy) strategies obtained from the stable-baselines library Raffin.Hill.ea:2021, trained (for a total of $10$k time steps) with default parameter values using a simple reward scheme: a good event yields a reward of $100$; a bad one, $-1$. In the simulations, the trained models are queried for deterministic action predictions. The example used will be discussed in Section \ref{['sec:stronglyRevealing']}, Example \ref{['ex:revealingTiger']}.
  • Figure 4: The POMDP $\mathcal{P}$ from \ref{['ex:incompleteCE']} (depicted on the left) with an almost-sure strategy, but whose belief-support MDP (depicted on the right) has no winning strategy. Notation $q, k$ inside a circle depicts a state $q$ with priority $k$. Transitions from states involving a bullet $\bullet$ indicate a probabilistic transition. In POMDPs, we always write the signals along transitions. Actions are omitted when they all induce the same transition from a given state, and probabilities equal to $1$ are omitted.
  • Figure 5: Strongly revealing tiger (Example \ref{['ex:revealingTiger']}).
  • ...and 5 more figures

Theorems & Definitions (58)

  • Remark 1
  • Definition 1: Weakly revealing
  • Proposition 1
  • Proposition 2
  • Theorem 1
  • proof
  • Remark 2
  • Example 1
  • Theorem 2
  • Theorem 3
  • ...and 48 more