Table of Contents
Fetching ...

OCMDP: Observation-Constrained Markov Decision Process

Taiyi Wang, Jianheng Liu, Bryan Lee, Zhihao Wu, Yu Wu

TL;DR

OCMDP addresses the challenge of balancing information-gathering costs with decision quality by introducing an Observation-Constrained Markov Decision Process and a model-free, iteratively optimized RL framework that decomposes sensing and control into separate policies. By defining a trajectory-based Q-function and employing alternating policy gradient updates, the approach converges to a locally optimal joint policy while reducing observation expenditures. Empirical results on a synthetic Diagnostic Chain and the HeartPole healthcare simulator show substantial improvements in both control performance and observation efficiency over strong baselines, with faster learning convergence. The work is significant for resource-constrained decision making in healthcare and other domains where costly observations must be judiciously managed to maintain task performance.

Abstract

In many practical applications, decision-making processes must balance the costs of acquiring information with the benefits it provides. Traditional control systems often assume full observability, an unrealistic assumption when observations are expensive. We tackle the challenge of simultaneously learning observation and control strategies in such cost-sensitive environments by introducing the Observation-Constrained Markov Decision Process (OCMDP), where the policy influences the observability of the true state. To manage the complexity arising from the combined observation and control actions, we develop an iterative, model-free deep reinforcement learning algorithm that separates the sensing and control components of the policy. This decomposition enables efficient learning in the expanded action space by focusing on when and what to observe, as well as determining optimal control actions, without requiring knowledge of the environment's dynamics. We validate our approach on a simulated diagnostic task and a realistic healthcare environment using HeartPole. Given both scenarios, the experimental results demonstrate that our model achieves a substantial reduction in observation costs on average, significantly outperforming baseline methods by a notable margin in efficiency.

OCMDP: Observation-Constrained Markov Decision Process

TL;DR

OCMDP addresses the challenge of balancing information-gathering costs with decision quality by introducing an Observation-Constrained Markov Decision Process and a model-free, iteratively optimized RL framework that decomposes sensing and control into separate policies. By defining a trajectory-based Q-function and employing alternating policy gradient updates, the approach converges to a locally optimal joint policy while reducing observation expenditures. Empirical results on a synthetic Diagnostic Chain and the HeartPole healthcare simulator show substantial improvements in both control performance and observation efficiency over strong baselines, with faster learning convergence. The work is significant for resource-constrained decision making in healthcare and other domains where costly observations must be judiciously managed to maintain task performance.

Abstract

In many practical applications, decision-making processes must balance the costs of acquiring information with the benefits it provides. Traditional control systems often assume full observability, an unrealistic assumption when observations are expensive. We tackle the challenge of simultaneously learning observation and control strategies in such cost-sensitive environments by introducing the Observation-Constrained Markov Decision Process (OCMDP), where the policy influences the observability of the true state. To manage the complexity arising from the combined observation and control actions, we develop an iterative, model-free deep reinforcement learning algorithm that separates the sensing and control components of the policy. This decomposition enables efficient learning in the expanded action space by focusing on when and what to observe, as well as determining optimal control actions, without requiring knowledge of the environment's dynamics. We validate our approach on a simulated diagnostic task and a realistic healthcare environment using HeartPole. Given both scenarios, the experimental results demonstrate that our model achieves a substantial reduction in observation costs on average, significantly outperforming baseline methods by a notable margin in efficiency.

Paper Structure

This paper contains 21 sections, 4 theorems, 21 equations, 6 figures, 1 algorithm.

Key Result

Lemma 1

Consider the Bellman optimality operator $\mathcal{T}$ defined as: Under the supremum norm $\|Q\|_\infty = \sup_{h_t, a_t} |Q(h_t, a_t)|$, the operator $\mathcal{T}$ acts as a contraction mapping with a contraction factor of $\gamma$. Repeated application of $\mathcal{T}$ guarantees convergence to the unique optimal action-value function $Q^*$.

Figures (6)

  • Figure 1: Active observations within full state space
  • Figure 2: Diagram of Observation-Constrained MDP (OCMDP) Solver
  • Figure 3: Diagnostic Task
  • Figure 4: An Illustration of the observation policy evolution (Diagnostic Chain)
  • Figure 5: Performance comparison between our proposed iterative policy optimization method and baseline approaches on the OCMDPs over Diagnostic Chain Task.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Lemma 1: Value Function Contraction
  • proof
  • Remark 1
  • Lemma 2: Policy Enhancement
  • proof
  • Proposition 1: Convergence to a Locally Optimal Policy
  • Proposition 2: Conditions for Global Optimality
  • proof