Dyadic Reinforcement Learning

Shuangning Li; Lluis Salvat Niell; Sung Won Choi; Inbal Nahum-Shani; Guy Shani; Susan Murphy

Dyadic Reinforcement Learning

Shuangning Li, Lluis Salvat Niell, Sung Won Choi, Inbal Nahum-Shani, Guy Shani, Susan Murphy

TL;DR

The paper tackles personalization of mobile-health interventions within dyads by introducing Dyadic RL, a two-tier hierarchical reinforcement learning framework that handles actions at different time scales and noisy, non-Markovian dynamics. The low-level policy uses randomized least-squares value iteration to optimize within time blocks, while the high-level policy employs Thompson sampling to select weekly actions, with a novel reward construction that denoises the high-level signal. A rigorous regret bound is proved, showing a sublinear rate of tilde{O}(H^3 S^{3/2} A^{1/2} |S^{high}|^{1/2} sqrt(KW)) under tabular assumptions, highlighting the benefit of hierarchical structure for learning efficiency. The authors validate Dyadic RL through toy simulations and a Roadmap 2.0–based simulation test bed, demonstrating robust performance against baselines and under varied delayed effects, with practical implications for implementing dyadic interventions in trials like ADAPTS HCT. The work advances interpretable, scalable dyadic interventions in mobile health by providing both theoretical guarantees and empirically grounded demonstration of real-world applicability.

Abstract

Mobile health aims to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proves crucial in helping individuals managing burdensome medical conditions. This presents opportunities in mobile health to design interventions that target the dyadic relationship -- the relationship between a target person and their care partner -- with the aim of enhancing social support. In this paper, we develop dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impact the dyad across multiple time intervals. The developed dyadic RL is Bayesian and hierarchical. We formally introduce the problem setup, develop dyadic RL and establish a regret bound. We demonstrate dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.

Dyadic Reinforcement Learning

TL;DR

Abstract

Paper Structure (33 sections, 1 theorem, 51 equations, 12 figures, 3 tables, 7 algorithms)

This paper contains 33 sections, 1 theorem, 51 equations, 12 figures, 3 tables, 7 algorithms.

Introduction
Related work
Hierarchical Reinforcement Learning
Reinforcement Learning on Social Networks
Reinforcement Learning in Mobile Health Studies
Dyadic RL
Notation and setup
Approximations for the environment
Dyadic RL algorithm
A Regret Bound
Simulations
Environment
Details of the construction of the environment
Variants of the toy environments
Reward structures
...and 18 more sections

Key Result

Theorem 1

Assume the reward is bounded in $[0,1]$, the action spaces and state spaces are finite discrete, and the feature mappings are one-hot encodings. Furthermore, assume that Approximations appr:episodic_mdp and appr:hier_structure hold true. Define Then, Algorithm alg:rlsvi_greedy_more_episodes with the following choice of parameters: and has cumulative expected regret bounded by The notation $\ti

Figures (12)

Figure 1: Simplified illustration of the dyad. Daily actions are directed towards the target person, while weekly actions aim to enhance the relationship between the target person and their care partner.
Figure 2: A directed acyclic graph illustrating the relationship of the variables in episode $k$. Here, $s^{\operatorname{low}}_{k,w,h}$, $a^{\operatorname{low}}_{k,w,h}$ and $r^{\operatorname{low}}_{k,w,h}$ represent the low-level state, action and reward at time period $h$ of time block $w$ respectively. Meanwhile, $s^{\operatorname{high}}_{k,w}$ and $a^{\operatorname{high}}_{k,w}$ represent the high-level state and action in time block $w$.
Figure 3: Two different types of mazes: there is a drift towards right, and the action set is $\left\{\operatorname{up}, \operatorname{down}\right\}$.
Figure 4: Two different reward structures.
Figure 5: Toy environment 1 in Table \ref{['table:toy_environments']}: A simulation setting where the reward signal is denser. There is no delayed effect at the level of time blocks. The results are an average of 10,000 independent experimental repetitions.
...and 7 more figures

Theorems & Definitions (3)

Theorem 1: Regret Bound
Definition 1: subMDP
Definition 2: Equivalent subMDPs

Dyadic Reinforcement Learning

TL;DR

Abstract

Dyadic Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (3)