Harnessing Causality in Reinforcement Learning With Bagged Decision Times
Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, Susan A. Murphy
TL;DR
This work addresses online RL for problems with bagged decision times, where actions within a day jointly yield a single end-of-bag reward and within-bag dynamics are non-Markovian. By leveraging an expert causal DAG, the authors construct a dynamical Bayesian sufficient statistic (D-BaSS) that yields Markovian state transitions across and within bags, enabling a $K$-periodic MDP formulation with time-varying discounts. They prove the existence of optimal state constructions and identify the minimal D-BaSS as the preferred state for policy learning, and they adapt BRLSVI (Bagged RLSVI) to learn online with a linear Q-function. The framework is validated on HeartSteps-based testbeds with varying treatment effects and DAG misspecifications, showing favorable performance and robustness, with practical implications for mHealth interventions where cumulative daily actions shape outcomes. Overall, the paper advances causal RL in nonstationary, multi-time-scale settings by integrating causal state construction, periodic dynamic programming, and online Bayesian RL, producing a method capable of learning effective policies in real-world, bagged-decision-time environments.
Abstract
We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collectively affect a user's daily commitment to being active. Our goal is to develop an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markov state transitions within and across bags. We then formulate this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman equations for stationary MDPs is generalized to handle periodic MDPs. We show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Finally, we evaluate the proposed method on testbed variants built from real data in a mobile health clinical trial.
