Causal prompting model-based offline reinforcement learning
Xuehui Yu, Yi Guan, Rujia Shen, Xin Li, Chen Tang, Jingchi Jiang
TL;DR
CPRL tackles offline reinforcement learning for real-world medical decision-support using suboptimal data by integrating Hip-BCPD dynamic models guided by invariant causal prompts with a hierarchical CCM policy that reuses learned skills. The Hip-BCPD framework captures shared causal structures across environments while encoding environment-specific hidden parameters, enabling robust generalization to new users. A model-ensemble strategy mitigates overfitting to noisy offline data, and a single policy leveraging reusable sub-skills improves stability across distribution shifts. Experiments on simulated glucose–insulin control and real-world Dnurse data show CPRL outperforms baselines and ablations validate the contributions of causal prompting and skill reuse, suggesting practical impact for data-limited clinical decision-support systems.
Abstract
Model-based offline Reinforcement Learning (RL) allows agents to fully utilise pre-collected datasets without requiring additional or unethical explorations. However, applying model-based offline RL to online systems presents challenges, primarily due to the highly suboptimal (noise-filled) and diverse nature of datasets generated by online systems. To tackle these issues, we introduce the Causal Prompting Reinforcement Learning (CPRL) framework, designed for highly suboptimal and resource-constrained online scenarios. The initial phase of CPRL involves the introduction of the Hidden-Parameter Block Causal Prompting Dynamic (Hip-BCPD) to model environmental dynamics. This approach utilises invariant causal prompts and aligns hidden parameters to generalise to new and diverse online users. In the subsequent phase, a single policy is trained to address multiple tasks through the amalgamation of reusable skills, circumventing the need for training from scratch. Experiments conducted across datasets with varying levels of noise, including simulation-based and real-world offline datasets from the Dnurse APP, demonstrate that our proposed method can make robust decisions in out-of-distribution and noisy environments, outperforming contemporary algorithms. Additionally, we separately verify the contributions of Hip-BCPDs and the skill-reuse strategy to the robustness of performance. We further analyse the visualised structure of Hip-BCPD and the interpretability of sub-skills. We released our source code and the first ever real-world medical dataset for precise medical decision-making tasks.
