Causal prompting model-based offline reinforcement learning

Xuehui Yu; Yi Guan; Rujia Shen; Xin Li; Chen Tang; Jingchi Jiang

Causal prompting model-based offline reinforcement learning

Xuehui Yu, Yi Guan, Rujia Shen, Xin Li, Chen Tang, Jingchi Jiang

TL;DR

CPRL tackles offline reinforcement learning for real-world medical decision-support using suboptimal data by integrating Hip-BCPD dynamic models guided by invariant causal prompts with a hierarchical CCM policy that reuses learned skills. The Hip-BCPD framework captures shared causal structures across environments while encoding environment-specific hidden parameters, enabling robust generalization to new users. A model-ensemble strategy mitigates overfitting to noisy offline data, and a single policy leveraging reusable sub-skills improves stability across distribution shifts. Experiments on simulated glucose–insulin control and real-world Dnurse data show CPRL outperforms baselines and ablations validate the contributions of causal prompting and skill reuse, suggesting practical impact for data-limited clinical decision-support systems.

Abstract

Model-based offline Reinforcement Learning (RL) allows agents to fully utilise pre-collected datasets without requiring additional or unethical explorations. However, applying model-based offline RL to online systems presents challenges, primarily due to the highly suboptimal (noise-filled) and diverse nature of datasets generated by online systems. To tackle these issues, we introduce the Causal Prompting Reinforcement Learning (CPRL) framework, designed for highly suboptimal and resource-constrained online scenarios. The initial phase of CPRL involves the introduction of the Hidden-Parameter Block Causal Prompting Dynamic (Hip-BCPD) to model environmental dynamics. This approach utilises invariant causal prompts and aligns hidden parameters to generalise to new and diverse online users. In the subsequent phase, a single policy is trained to address multiple tasks through the amalgamation of reusable skills, circumventing the need for training from scratch. Experiments conducted across datasets with varying levels of noise, including simulation-based and real-world offline datasets from the Dnurse APP, demonstrate that our proposed method can make robust decisions in out-of-distribution and noisy environments, outperforming contemporary algorithms. Additionally, we separately verify the contributions of Hip-BCPDs and the skill-reuse strategy to the robustness of performance. We further analyse the visualised structure of Hip-BCPD and the interpretability of sub-skills. We released our source code and the first ever real-world medical dataset for precise medical decision-making tasks.

Causal prompting model-based offline reinforcement learning

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 8 equations, 6 figures, 4 tables, 2 algorithms.

Introduction
Background
Causal Prompting Reinforcement Learning
Dynamic Model: Hidden-Parameter Block Causal Prompting Dynamic
Causal Prompt
Hip-BCPD
Policy: Skill-Reuse Strategy
Preventing Overfitting: Model Ensemble
Experiments
Design of Causal Prompt
Experimental Environment Setups
Baselines
Comparative Evaluation of CPRL in Simulation and Real-World Experiments
Visualisation of Hip-BCPD and Interpretability of Learned Skills
Robustness for different noisy and resource-limited versions of datasets
...and 2 more sections

Figures (6)

Figure 1: Illustration of the Causal Prompting Reinforcement Learning (CPRL) framework. CPRL learns on suboptimal offline datasets using the Causal Prompt $\mathcal{K}$ as a guiding mechanism. Causal prompt$\mathcal{K}$ leverages causal knowledge (on the top left) as a template, amalgamating the original input $o$ with reconstructed observations $o^{prompt}$. At the bottom left of the figure, the glucose-insulin system represents the pre-trained model, while the graphical structure visualises causal knowledge. Masked model$P_\Theta$ (green box) generates task-specific hidden parameters $\theta \in \Theta$ (i.e., $[MASK]$ in the causal prompt $\mathcal{K}$). The CPRL framework predominantly consists of two processes: 1) learning dynamic models (grey box); and 2) learning policies (pink box). Agents acquire dynamic models from offline datasets with the guidance of the causal prompt $\mathcal{K}$ and hidden parameters $\theta$. The constructed dynamic model is subsequently utilised for downstream policy learning.
Figure 2: Visualizations of Hip-BCPD and Hip-BMDP settings. In the Hip-BCPD setting, causal promptings $\mathcal{K}_1$ and $\mathcal{K}_2$ act as edges connecting variables in state $s=[s_1,s_2,s_3]$ into a causal graph. Instead, the state $s$ is latent and without a graph structure in the HiP-BMDP setting. In Hip-BMDP and Hip-BCPD, hidden parameters $\theta$ decide transition distributions $T_{\theta}$ and emission function $q$ vary among diverse environments. Figure \ref{['fig2']}(b) is cited from zhang2020learning.
Figure 3: Problem Statement. (a) Source of the offline datasets. (b) Real-world offline datasets are highly suboptimal, encompassing missing data (e.g., omitted uploads), value errors (e.g., incorrect input values, miscalculated carbohydrate estimates), and misplaced data. Misplaced data pertains to data that is outside the anticipated time horizon. (c) Our CPRL utilises observations as input to furnish support for medical decision-making, including suggestions for insulin dosages and meal size recommendations.
Figure 4: (a) The visualisation of learned Hip-BCPD and learned skills. (b) Modularisation of a glucose-insulin control system (cited from yu2022causal). The glucose-insulin control system can be segmented into the insulin subsystem, glucose subsystem, and other unit process models, each necessitating mutual information and influence.
Figure 5: Box and Whisker Plots for different noise levels of the Dnurse offline datasets.
...and 1 more figures

Causal prompting model-based offline reinforcement learning

TL;DR

Abstract

Causal prompting model-based offline reinforcement learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)