OCMDP: Observation-Constrained Markov Decision Process
Taiyi Wang, Jianheng Liu, Bryan Lee, Zhihao Wu, Yu Wu
TL;DR
OCMDP addresses the challenge of balancing information-gathering costs with decision quality by introducing an Observation-Constrained Markov Decision Process and a model-free, iteratively optimized RL framework that decomposes sensing and control into separate policies. By defining a trajectory-based Q-function and employing alternating policy gradient updates, the approach converges to a locally optimal joint policy while reducing observation expenditures. Empirical results on a synthetic Diagnostic Chain and the HeartPole healthcare simulator show substantial improvements in both control performance and observation efficiency over strong baselines, with faster learning convergence. The work is significant for resource-constrained decision making in healthcare and other domains where costly observations must be judiciously managed to maintain task performance.
Abstract
In many practical applications, decision-making processes must balance the costs of acquiring information with the benefits it provides. Traditional control systems often assume full observability, an unrealistic assumption when observations are expensive. We tackle the challenge of simultaneously learning observation and control strategies in such cost-sensitive environments by introducing the Observation-Constrained Markov Decision Process (OCMDP), where the policy influences the observability of the true state. To manage the complexity arising from the combined observation and control actions, we develop an iterative, model-free deep reinforcement learning algorithm that separates the sensing and control components of the policy. This decomposition enables efficient learning in the expanded action space by focusing on when and what to observe, as well as determining optimal control actions, without requiring knowledge of the environment's dynamics. We validate our approach on a simulated diagnostic task and a realistic healthcare environment using HeartPole. Given both scenarios, the experimental results demonstrate that our model achieves a substantial reduction in observation costs on average, significantly outperforming baseline methods by a notable margin in efficiency.
