C-IDS: Solving Contextual POMDP via Information-Directed Objective

Chongyang Shi; Michael Dorothy; Jie Fu

C-IDS: Solving Contextual POMDP via Information-Directed Objective

Chongyang Shi, Michael Dorothy, Jie Fu

TL;DR

This work addresses policy synthesis in CPOMDPs where an unknown latent context shapes environment dynamics. It proposes C-IDS, an information-directed objective that blends reward with mutual information about the context, and a variational policy gradient to optimize it. The authors establish a sublinear Bayesian regret bound by interpreting the objective as a Lagrangian relaxation of the linear information ratio, and validate the approach in a continuous Light–Dark setting where faster context identification yields higher returns. Empirically, C-IDS outperforms standard POMDP solvers that ignore context uncertainty, demonstrating the value of active information acquisition in context-rich, partially observable environments.

Abstract

We study the policy synthesis problem in contextual partially observable Markov decision processes (CPOMDPs), where the environment is governed by an unknown latent context that induces distinct POMDP dynamics. Our goal is to design a policy that simultaneously maximizes cumulative return and actively reduces uncertainty about the underlying context. We introduce an information-directed objective that augments reward maximization with mutual information between the latent context and the agent's observations. We develop the C-IDS algorithm to synthesize policies that maximize the information-directed objective. We show that the objective can be interpreted as a Lagrangian relaxation of the linear information ratio and prove that the temperature parameter is an upper bound on the information ratio. Based on this characterization, we establish a sublinear Bayesian regret bound over K episodes. We evaluate our approach on a continuous Light-Dark environment and show that it consistently outperforms standard POMDP solvers that treat the unknown context as a latent state variable, achieving faster context identification and higher returns.

C-IDS: Solving Contextual POMDP via Information-Directed Objective

TL;DR

Abstract

Paper Structure (29 sections, 8 theorems, 94 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 29 sections, 8 theorems, 94 equations, 5 figures, 1 table, 2 algorithms.

Introduction
Contribution.
Related Work.
Contextual Partially Observable Markov Decision Process
Planning to Maximize Information-Directed Objective
$C$-IDS: Maximizing a weighted combination of reward and information gain
Variational Policy Gradient for Optimizing Information-Directed Objective
Regret Analysis of $C$-IDS Algorithm
Bound for Expected Episodic Regret $\Delta_k$
Bounding the Term $I_k^{(1)}$
Bounding the Martingale Difference $I_k^{(2)}$
Final Regret Bound
Experiments
Environment.
Implementation.
...and 14 more sections

Key Result

Lemma 2.2

The entropy-regulated optimal distribution $Q$ that maximizes where $H(Q)$ is the Shannon entropy of $Q$, is given by: where $R_c(y) = \mathds{E}_i \left[\sum_{t=0}^T R_c(S_t, A_t)|y\right]$.

Figures (5)

Figure 1: A robot moves in a line grid, with seven cells. At flag cells, the robot can receive some reward. At detector cell, the robot will be detected and receive penalty. Context 0: Cell 0 has high-value target with reward, and cell $6$ is equipped with a detector. Cell 0 has a low-value target with reward 10. Context 1: Cell $6$ has high-value target with reward, and cell $1$ is equipped with a detector. The robot can choose to move to one of the adjacent cell or the robot uses "sense" action to detect the presence of detector to its left or right.
Figure 2: The pictures show the light and dark environment. In context $0$, the light region is $x > 0$ and the dark region is $x \le 0$. In context $1$, the light region is $x < 0$ and the dark region is $x \ge 0$. In different regions, the agent has different observations noise. The observation models are Gaussian distributions shown by red and blue curves in the pictures. In context $0$, the reward region is $x > 1$ and the penalty region is $x < 1$. In context $1$, the reward region is $x < -1$ and the penalty region is $x > 0$.
Figure 3: Convergence results for total return, entropy, and regret for different algorithms.
Figure 4: The error bars for different variance ratio.
Figure 5: Trajectories generated under C-IDS policy and comparisons with baseline methods.

Theorems & Definitions (17)

Lemma 2.2
proof
Lemma 2.3
proof
Remark 3.3
Lemma 3.4
Lemma 3.5: Lagrangian surrogate for ratio minimization
proof : Proof of Lemma \ref{['lem:upper-bound']}
Lemma 3.6
proof
...and 7 more

C-IDS: Solving Contextual POMDP via Information-Directed Objective

TL;DR

Abstract

C-IDS: Solving Contextual POMDP via Information-Directed Objective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (17)