Table of Contents
Fetching ...

C-IDS: Solving Contextual POMDP via Information-Directed Objective

Chongyang Shi, Michael Dorothy, Jie Fu

TL;DR

This work addresses policy synthesis in CPOMDPs where an unknown latent context shapes environment dynamics. It proposes C-IDS, an information-directed objective that blends reward with mutual information about the context, and a variational policy gradient to optimize it. The authors establish a sublinear Bayesian regret bound by interpreting the objective as a Lagrangian relaxation of the linear information ratio, and validate the approach in a continuous Light–Dark setting where faster context identification yields higher returns. Empirically, C-IDS outperforms standard POMDP solvers that ignore context uncertainty, demonstrating the value of active information acquisition in context-rich, partially observable environments.

Abstract

We study the policy synthesis problem in contextual partially observable Markov decision processes (CPOMDPs), where the environment is governed by an unknown latent context that induces distinct POMDP dynamics. Our goal is to design a policy that simultaneously maximizes cumulative return and actively reduces uncertainty about the underlying context. We introduce an information-directed objective that augments reward maximization with mutual information between the latent context and the agent's observations. We develop the C-IDS algorithm to synthesize policies that maximize the information-directed objective. We show that the objective can be interpreted as a Lagrangian relaxation of the linear information ratio and prove that the temperature parameter is an upper bound on the information ratio. Based on this characterization, we establish a sublinear Bayesian regret bound over K episodes. We evaluate our approach on a continuous Light-Dark environment and show that it consistently outperforms standard POMDP solvers that treat the unknown context as a latent state variable, achieving faster context identification and higher returns.

C-IDS: Solving Contextual POMDP via Information-Directed Objective

TL;DR

This work addresses policy synthesis in CPOMDPs where an unknown latent context shapes environment dynamics. It proposes C-IDS, an information-directed objective that blends reward with mutual information about the context, and a variational policy gradient to optimize it. The authors establish a sublinear Bayesian regret bound by interpreting the objective as a Lagrangian relaxation of the linear information ratio, and validate the approach in a continuous Light–Dark setting where faster context identification yields higher returns. Empirically, C-IDS outperforms standard POMDP solvers that ignore context uncertainty, demonstrating the value of active information acquisition in context-rich, partially observable environments.

Abstract

We study the policy synthesis problem in contextual partially observable Markov decision processes (CPOMDPs), where the environment is governed by an unknown latent context that induces distinct POMDP dynamics. Our goal is to design a policy that simultaneously maximizes cumulative return and actively reduces uncertainty about the underlying context. We introduce an information-directed objective that augments reward maximization with mutual information between the latent context and the agent's observations. We develop the C-IDS algorithm to synthesize policies that maximize the information-directed objective. We show that the objective can be interpreted as a Lagrangian relaxation of the linear information ratio and prove that the temperature parameter is an upper bound on the information ratio. Based on this characterization, we establish a sublinear Bayesian regret bound over K episodes. We evaluate our approach on a continuous Light-Dark environment and show that it consistently outperforms standard POMDP solvers that treat the unknown context as a latent state variable, achieving faster context identification and higher returns.
Paper Structure (29 sections, 8 theorems, 94 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 29 sections, 8 theorems, 94 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Lemma 2.2

The entropy-regulated optimal distribution $Q$ that maximizes where $H(Q)$ is the Shannon entropy of $Q$, is given by: where $R_c(y) = \mathds{E}_i \left[\sum_{t=0}^T R_c(S_t, A_t)|y\right]$.

Figures (5)

  • Figure 1: A robot moves in a line grid, with seven cells. At flag cells, the robot can receive some reward. At detector cell, the robot will be detected and receive penalty. Context 0: Cell 0 has high-value target with reward, and cell $6$ is equipped with a detector. Cell 0 has a low-value target with reward 10. Context 1: Cell $6$ has high-value target with reward, and cell $1$ is equipped with a detector. The robot can choose to move to one of the adjacent cell or the robot uses "sense" action to detect the presence of detector to its left or right.
  • Figure 2: The pictures show the light and dark environment. In context $0$, the light region is $x > 0$ and the dark region is $x \le 0$. In context $1$, the light region is $x < 0$ and the dark region is $x \ge 0$. In different regions, the agent has different observations noise. The observation models are Gaussian distributions shown by red and blue curves in the pictures. In context $0$, the reward region is $x > 1$ and the penalty region is $x < 1$. In context $1$, the reward region is $x < -1$ and the penalty region is $x > 0$.
  • Figure 3: Convergence results for total return, entropy, and regret for different algorithms.
  • Figure 4: The error bars for different variance ratio.
  • Figure 5: Trajectories generated under C-IDS policy and comparisons with baseline methods.

Theorems & Definitions (17)

  • Lemma 2.2
  • proof
  • Lemma 2.3
  • proof
  • Remark 3.3
  • Lemma 3.4
  • Lemma 3.5: Lagrangian surrogate for ratio minimization
  • proof : Proof of Lemma \ref{['lem:upper-bound']}
  • Lemma 3.6
  • proof
  • ...and 7 more