Belief-State Query Policies for User-Aligned POMDPs
Daniel Bramblett, Siddharth Srivastava
TL;DR
The paper introduces belief-state query (BSQ) policies for expressing user-aligned preferences in goal-oriented POMDPs (gPOMDPs) and formally analyzes their properties. It proves that the expected cost $E_\pi(\overline{\vartheta};H)$ of parameterized BSQ policies is piecewise constant and non-convex, with parameter space partitioned into a finite set of braids corresponding to leaves of strategy trees. A novel Partition Refinement Search (PRS) algorithm is proposed, which probabilistically completes to the optimal user-aligned policy by refining parameter partitions along braid boundaries. Empirical results on Lane Merger, Spaceship Repair, Graph Rock Sample, and Store Visit show PRS outperforming baselines and existing solvers in producing policies that align with user requirements, while being computationally feasible. The work enables user-driven constraint specification in partially observable settings without reward shaping, highlighting both practical impact and avenues for future extensions.
Abstract
Planning in real-world settings often entails addressing partial observability while aligning with users' requirements. We present a novel framework for expressing users' constraints and preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) policies in the setting of goal-oriented partially observable Markov decision processes (gPOMDPs). We present the first formal analysis of such constraints and prove that while the expected cost function of a parameterized BSQ policy w.r.t its parameters is not convex, it is piecewise constant and yields an implicit discrete parameter search space that is finite for finite horizons. This theoretical result leads to novel algorithms that optimize gPOMDP agent behavior with guaranteed user alignment. Analysis proves that our algorithms converge to the optimal user-aligned behavior in the limit. Empirical results show that parameterized BSQ policies provide a computationally feasible approach for user-aligned planning in partially observable settings.
