Table of Contents
Fetching ...

Active Perception with Initial-State Uncertainty: A Policy Gradient Method

Chongyang Shi, Shuo Han, Michael Dorothy, Jie Fu

TL;DR

This paper addresses active perception in a stochastic system modeled as a hidden Markov model with controllable emissions, aiming to minimize the conditional entropy $H(S_0|Y)$ to maximize information about the initial state $S_0$ from partial observations $Y=(O,A)$. The authors develop a policy-gradient method that leverages an observable-operator formulation to efficiently compute the gradient $\nabla_\theta H(S_0|Y;\theta)$, and prove convergence under Lipschitz continuity and smoothness assumptions. They establish that the planning problem cannot be reduced to a $\rho$-POMDP with a belief-based reward and provide a principled gradient estimator, including a sample-based approximation for scalability. The method is validated on a stochastic grid world, where a min-entropy policy achieves faster and more accurate identification of the true initial state than random policies, demonstrating practical potential for intent recognition and system diagnosis.

Abstract

This paper studies the synthesis of an active perception policy that maximizes the information leakage of the initial state in a stochastic system modeled as a hidden Markov model (HMM). Specifically, the emission function of the HMM is controllable with a set of perception or sensor query actions. Given the goal is to infer the initial state from partial observations in the HMM, we use Shannon conditional entropy as the planning objective and develop a novel policy gradient method with convergence guarantees. By leveraging a variant of observable operators in HMMs, we prove several important properties of the gradient of the conditional entropy with respect to the policy parameters, which allow efficient computation of the policy gradient and stable and fast convergence. We demonstrate the effectiveness of our solution by applying it to an inference problem in a stochastic grid world environment.

Active Perception with Initial-State Uncertainty: A Policy Gradient Method

TL;DR

This paper addresses active perception in a stochastic system modeled as a hidden Markov model with controllable emissions, aiming to minimize the conditional entropy to maximize information about the initial state from partial observations . The authors develop a policy-gradient method that leverages an observable-operator formulation to efficiently compute the gradient , and prove convergence under Lipschitz continuity and smoothness assumptions. They establish that the planning problem cannot be reduced to a -POMDP with a belief-based reward and provide a principled gradient estimator, including a sample-based approximation for scalability. The method is validated on a stochastic grid world, where a min-entropy policy achieves faster and more accurate identification of the true initial state than random policies, demonstrating practical potential for intent recognition and system diagnosis.

Abstract

This paper studies the synthesis of an active perception policy that maximizes the information leakage of the initial state in a stochastic system modeled as a hidden Markov model (HMM). Specifically, the emission function of the HMM is controllable with a set of perception or sensor query actions. Given the goal is to infer the initial state from partial observations in the HMM, we use Shannon conditional entropy as the planning objective and develop a novel policy gradient method with convergence guarantees. By leveraging a variant of observable operators in HMMs, we prove several important properties of the gradient of the conditional entropy with respect to the policy parameters, which allow efficient computation of the policy gradient and stable and fast convergence. We demonstrate the effectiveness of our solution by applying it to an inference problem in a stochastic grid world environment.
Paper Structure (5 sections, 7 theorems, 32 equations, 2 figures)

This paper contains 5 sections, 7 theorems, 32 equations, 2 figures.

Key Result

Proposition 1

There is no belief-based reward $R:\Delta(S)\rightarrow \mathbb{R}$ such that $\sum_{i=0}^t R(b_i) = - H(S_0|y_{0:t})$.

Figures (2)

  • Figure 1: A stochastic grid world monitored by a set of sensors.
  • Figure 2: The convergence results of policy gradients algorithm and beliefs evolution for different true types of robots.

Theorems & Definitions (19)

  • Definition 1
  • Definition 2
  • Definition 3
  • Remark 1
  • Proposition 1
  • proof
  • Definition 4
  • Proposition 2
  • proof
  • Proposition 3
  • ...and 9 more