Table of Contents
Fetching ...

Harnessing Causality in Reinforcement Learning With Bagged Decision Times

Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, Susan A. Murphy

TL;DR

This work addresses online RL for problems with bagged decision times, where actions within a day jointly yield a single end-of-bag reward and within-bag dynamics are non-Markovian. By leveraging an expert causal DAG, the authors construct a dynamical Bayesian sufficient statistic (D-BaSS) that yields Markovian state transitions across and within bags, enabling a $K$-periodic MDP formulation with time-varying discounts. They prove the existence of optimal state constructions and identify the minimal D-BaSS as the preferred state for policy learning, and they adapt BRLSVI (Bagged RLSVI) to learn online with a linear Q-function. The framework is validated on HeartSteps-based testbeds with varying treatment effects and DAG misspecifications, showing favorable performance and robustness, with practical implications for mHealth interventions where cumulative daily actions shape outcomes. Overall, the paper advances causal RL in nonstationary, multi-time-scale settings by integrating causal state construction, periodic dynamic programming, and online Bayesian RL, producing a method capable of learning effective policies in real-world, bagged-decision-time environments.

Abstract

We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collectively affect a user's daily commitment to being active. Our goal is to develop an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markov state transitions within and across bags. We then formulate this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman equations for stationary MDPs is generalized to handle periodic MDPs. We show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Finally, we evaluate the proposed method on testbed variants built from real data in a mobile health clinical trial.

Harnessing Causality in Reinforcement Learning With Bagged Decision Times

TL;DR

This work addresses online RL for problems with bagged decision times, where actions within a day jointly yield a single end-of-bag reward and within-bag dynamics are non-Markovian. By leveraging an expert causal DAG, the authors construct a dynamical Bayesian sufficient statistic (D-BaSS) that yields Markovian state transitions across and within bags, enabling a -periodic MDP formulation with time-varying discounts. They prove the existence of optimal state constructions and identify the minimal D-BaSS as the preferred state for policy learning, and they adapt BRLSVI (Bagged RLSVI) to learn online with a linear Q-function. The framework is validated on HeartSteps-based testbeds with varying treatment effects and DAG misspecifications, showing favorable performance and robustness, with practical implications for mHealth interventions where cumulative daily actions shape outcomes. Overall, the paper advances causal RL in nonstationary, multi-time-scale settings by integrating causal state construction, periodic dynamic programming, and online Bayesian RL, producing a method capable of learning effective policies in real-world, bagged-decision-time environments.

Abstract

We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collectively affect a user's daily commitment to being active. Our goal is to develop an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markov state transitions within and across bags. We then formulate this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman equations for stationary MDPs is generalized to handle periodic MDPs. We show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Finally, we evaluate the proposed method on testbed variants built from real data in a mobile health clinical trial.

Paper Structure

This paper contains 68 sections, 10 theorems, 133 equations, 25 figures, 3 tables, 6 algorithms.

Key Result

Lemma 4.1

Under Assumption asp:state.D.BaSS, the state transition is Markovian within and across bags, and we have $R_{d} \mathrel{\perp\!\!\!\perp} S_{t, l}, A_{t, l} | S_{d, k}, A_{d, k}$ for any $(t, l) < (d, k)$. Further, $S_{d, k + 1} \mathrel{\perp\!\!\!\perp} \widetilde{\bm{B}}_{d - 1} | S_{d, k}, A_{d

Figures (25)

  • Figure 1: Causal DAG for bag $d$ and $d + 1$. The arrows pointing to the actions $A_{d, 1:K}$ are omitted.
  • Figure 2: The average cumulative rewards of BRLSVI, SRLSVI, RLSVI, and RAND, each subtracting the average cumulative rewards of the zero policy.
  • Figure 3: Causal DAG when Figure \ref{['fig:dag']} is misspecified. The arrows pointing to the actions $A_{d, 1:K}$ are omitted.
  • Figure 4: The average cumulative rewards of BRLSVI learned with states $S^{\prime}$, $S^{\prime \prime}$, or $S^{\prime \prime \prime}$ in (\ref{['equ:state.comparison']}), each subtracting the average cumulative rewards of the zero policy. The vertical dotted line represents the end of the warm-up period.
  • Figure 5: The number of days before, during, and after the study for each user. "Before study" is the period when the trial starts recording activities for a user, but the user has not yet started wearing the Fitbit. "Warm up" is the period during which a user wears the Fitbit for at least 8 hours per day for 7 days. The length of the warm-up period may extend beyond 8 days if the user wears the Fitbit for less than 8 hours on any given day within this timeframe. "In study" is the period during which the RL algorithm is active for a user, and the user has not yet dropped out. "Drop out" is the period during which the RL algorithm continues to run for a user, but the user no longer wears the Fitbit for more than 8 hours per day.
  • ...and 20 more figures

Theorems & Definitions (23)

  • Definition 3.1: Bagged Decision Times
  • Definition 4.1: D-BaSS
  • Lemma 4.1
  • Definition 4.2: $K$-Periodic MDP
  • Theorem 4.2: Bellman Optimality Equations
  • Lemma 4.3
  • Theorem 4.4
  • Corollary 4.5
  • Lemma B.1
  • Lemma B.2
  • ...and 13 more