Belief-State Query Policies for User-Aligned POMDPs

Daniel Bramblett; Siddharth Srivastava

Belief-State Query Policies for User-Aligned POMDPs

Daniel Bramblett, Siddharth Srivastava

TL;DR

The paper introduces belief-state query (BSQ) policies for expressing user-aligned preferences in goal-oriented POMDPs (gPOMDPs) and formally analyzes their properties. It proves that the expected cost $E_\pi(\overline{\vartheta};H)$ of parameterized BSQ policies is piecewise constant and non-convex, with parameter space partitioned into a finite set of braids corresponding to leaves of strategy trees. A novel Partition Refinement Search (PRS) algorithm is proposed, which probabilistically completes to the optimal user-aligned policy by refining parameter partitions along braid boundaries. Empirical results on Lane Merger, Spaceship Repair, Graph Rock Sample, and Store Visit show PRS outperforming baselines and existing solvers in producing policies that align with user requirements, while being computationally feasible. The work enables user-driven constraint specification in partially observable settings without reward shaping, highlighting both practical impact and avenues for future extensions.

Abstract

Planning in real-world settings often entails addressing partial observability while aligning with users' requirements. We present a novel framework for expressing users' constraints and preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) policies in the setting of goal-oriented partially observable Markov decision processes (gPOMDPs). We present the first formal analysis of such constraints and prove that while the expected cost function of a parameterized BSQ policy w.r.t its parameters is not convex, it is piecewise constant and yields an implicit discrete parameter search space that is finite for finite horizons. This theoretical result leads to novel algorithms that optimize gPOMDP agent behavior with guaranteed user alignment. Analysis proves that our algorithms converge to the optimal user-aligned behavior in the limit. Empirical results show that parameterized BSQ policies provide a computationally feasible approach for user-aligned planning in partially observable settings.

Belief-State Query Policies for User-Aligned POMDPs

TL;DR

of parameterized BSQ policies is piecewise constant and non-convex, with parameter space partitioned into a finite set of braids corresponding to leaves of strategy trees. A novel Partition Refinement Search (PRS) algorithm is proposed, which probabilistically completes to the optimal user-aligned policy by refining parameter partitions along braid boundaries. Empirical results on Lane Merger, Spaceship Repair, Graph Rock Sample, and Store Visit show PRS outperforming baselines and existing solvers in producing policies that align with user requirements, while being computationally feasible. The work enables user-driven constraint specification in partially observable settings without reward shaping, highlighting both practical impact and avenues for future extensions.

Abstract

Paper Structure (29 sections, 17 theorems, 18 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 17 theorems, 18 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Formal Framework
Goal-Oriented Partially Observable Markov Decision Process
Belief-State Queries and Policies
Formal Analysis
Strategy Trees
Non-convexity of the expected cost function
Braids
BSQ Policies are Piecewise Constant
Partition Refinement Search
Partition Selection Approaches
Empirical Results
Baselines
Analysis of Results
...and 14 more sections

Key Result

Lemma 1

Let $\Psi(b;\overline{\Theta})$ be an n-dimensional compound BSQ. There exists a set of intervals $I(\Psi) \subseteq \mathbb{R}^n$ s.t. $\Psi(b;\overline{\Theta})$ evaluates to true iff $\overline{\Theta}\in I(\Psi)$.

Figures (6)

Figure 1: (a) Spaceship Repair running example. (b) parameterized BSQ policy for the user preference from the Introduction. (c) The expected cumulative cost function for (b) with a horizon of 12.
Figure 2: (a) Strategy tree created from parameterized BSQ policy in Fig. \ref{['fig:spaceship_repair_combined']} and Spaceship Repair gPOMDP with horizon of 2. (b) Complete partitions of parameter space with two of the braids highlighted. Error detection sensor accuracy for the robot and ship is 60% and 75%, respectively.
Figure 3: Empirical results evaluating the hypothesized optimal partition performance tracked. Equally spaced samples across PRS evaluation time are taken while a sample is taken each iteration of Nelder-Mead and Particle Swarm. The error displayed is the standard deviation error.
Figure 4: Results for PRS with different partition selection approaches from Section \ref{['sec:ips_variants']}.
Figure 5: Performance of the hypothesized optimal partition while solving for the Lane Merger, Spaceship Repair, and Store Visit problems. Each line is the average over 10 independent runs with the standard deviation error shown.
...and 1 more figures

Theorems & Definitions (38)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5
Definition 6
Definition 7
Lemma 1
Definition 8
Definition 9
...and 28 more

Belief-State Query Policies for User-Aligned POMDPs

TL;DR

Abstract

Belief-State Query Policies for User-Aligned POMDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (38)