Table of Contents
Fetching ...

RoME: A Robust Mixed-Effects Bandit Algorithm for Optimizing Mobile Health Interventions

Easton K. Huch, Jieru Shi, Madeline R. Abbott, Jessica R. Golbus, Alexander Moreno, Walter H. Dempsey

TL;DR

RoME introduces a robust mixed-effects contextual bandit for mobile health by combining debiased machine learning, a partially linear reward model, and network cohesion to handle nonlinear baselines and longitudinal heterogeneity. The method yields a high-probability regret bound that scales with the differential-reward dimension $d$ and remains robust to misspecification of the baseline model. Empirical results show RoME outperforming competing approaches in heterogeneous and nonlinear settings, with strong off-policy gains in the Valentine and Intern Health Study datasets and substantial computational efficiency. This framework enables scalable, personalized, context-aware interventions in mHealth while providing theoretical guarantees and practical performance improvements.

Abstract

Mobile health leverages personalized and contextually tailored interventions optimized through bandit and reinforcement learning algorithms. In practice, however, challenges such as participant heterogeneity, nonstationarity, and nonlinear relationships hinder algorithm performance. We propose RoME, a Robust Mixed-Effects contextual bandit algorithm that simultaneously addresses these challenges via (1) modeling the differential reward with user- and time-specific random effects, (2) network cohesion penalties, and (3) debiased machine learning for flexible estimation of baseline rewards. We establish a high-probability regret bound that depends solely on the dimension of the differential-reward model, enabling us to achieve robust regret bounds even when the baseline reward is highly complex. We demonstrate the superior performance of the RoME algorithm in a simulation and two off-policy evaluation studies.

RoME: A Robust Mixed-Effects Bandit Algorithm for Optimizing Mobile Health Interventions

TL;DR

RoME introduces a robust mixed-effects contextual bandit for mobile health by combining debiased machine learning, a partially linear reward model, and network cohesion to handle nonlinear baselines and longitudinal heterogeneity. The method yields a high-probability regret bound that scales with the differential-reward dimension and remains robust to misspecification of the baseline model. Empirical results show RoME outperforming competing approaches in heterogeneous and nonlinear settings, with strong off-policy gains in the Valentine and Intern Health Study datasets and substantial computational efficiency. This framework enables scalable, personalized, context-aware interventions in mHealth while providing theoretical guarantees and practical performance improvements.

Abstract

Mobile health leverages personalized and contextually tailored interventions optimized through bandit and reinforcement learning algorithms. In practice, however, challenges such as participant heterogeneity, nonstationarity, and nonlinear relationships hinder algorithm performance. We propose RoME, a Robust Mixed-Effects contextual bandit algorithm that simultaneously addresses these challenges via (1) modeling the differential reward with user- and time-specific random effects, (2) network cohesion penalties, and (3) debiased machine learning for flexible estimation of baseline rewards. We establish a high-probability regret bound that depends solely on the dimension of the differential-reward model, enabling us to achieve robust regret bounds even when the baseline reward is highly complex. We demonstrate the superior performance of the RoME algorithm in a simulation and two off-policy evaluation studies.
Paper Structure (42 sections, 17 theorems, 111 equations, 13 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 17 theorems, 111 equations, 13 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Under Assumptions assm:bounded-policy--assmp:subgaussian, assmp:theta-structure--assumption:Vitunderbar-bound, with probability at least $1-\delta$, $\boldsymbol{\operatorname{Regret}}_K$ is of order

Figures (13)

  • Figure 1: Illustration of the staged recruitment scheme. At each recruitment stage (each time point), a new participant is recruited and observed. At the same time, all participants who were recruited prior to the current stage are also observed again. Observations are not collected from participants who have yet to be recruited. For simplicity, we assume one participant is recruited at each stage.
  • Figure 2: Cumulative regret in the (a) Homogeneous Users, (b) Heterogeneous Users, and (c) Nonlinear settings. RoME performs competitively in the first setting (the simplest), and it substantially outperforms the next-best method (IntelPooling) in the others.
  • Figure 3: (left) Boxplot of unbiased estimates of the average per-trial reward for all five competing algorithms, relative to the reward obtained under the pre-specified Valentine randomization policy across 100 bootstrap samples. Within each box, the asterisk ($\ast$) indicates the mean value, while the mid-bar represents the median. (right) Heatmap of p-values from the pairwise paired t-tests.
  • Figure 4: (left) The baseline reward function $g_t(S_{i,t})$ (constant across time in this case) used in the simulation study. The proposed method allows this function to be a nonlinear function of the context vectors. The baseline was generated using a combination of recursive partitioning and by summing scaled, shifted, and rotated Gaussian densities. (right) The time-specific parameters used in the simulation study. These parameters cause the advantage function to vary over time. We set them such that the advantage function changes quickly at the beginning of the study then stabilizes.
  • Figure 5: (left) The baseline reward function $g_t(S_{i,t})$ used in the simulation study compared to (right) the estimated baseline reward from our neural network in the nonlinear setting.
  • ...and 8 more figures

Theorems & Definitions (39)

  • Theorem 1
  • Lemma 1: Unbiasedness of the IPS estimator
  • proof
  • Definition 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 29 more