Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits

Biyonka Liang; Lily Xu; Aparna Taneja; Milind Tambe; Lucas Janson

Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits

Biyonka Liang, Lily Xu, Aparna Taneja, Milind Tambe, Lucas Janson

TL;DR

This paper tackles efficient allocation of scarce interventions in public health by modeling beneficiary adherence as a contextual and non-stationary restless RMAB with $N$ arms, budget $B$, and horizon $T$. It introduces BCoR, a Bayesian contextual RMAB method that blends hierarchical Bayesian modeling with Thompson sampling to share information within arms and across arms, and to handle non-stationarity via spline-based time effects. Key contributions include the articulated Bayesian learning framework for $P_i^{(t)}(1\mid s,a)$ with within-arm and across-arm sharing, the use of a Whittle-index policy for online arm selection, and extensive empirical validation on both simulated settings and a real ARMMAN data-driven scenario. BCoR demonstrates substantial finite-sample gains over strong baselines (including a $61\%$ increase in engagement in the ARMMAN experiment with $B=10$), supporting its practical potential for deployment in large-scale mHealth programs.

Abstract

Public health programs often provide interventions to encourage program adherence, and effectively allocating interventions is vital for producing the greatest overall health outcomes, especially in underserved communities where resources are limited. Such resource allocation problems are often modeled as restless multi-armed bandits (RMABs) with unknown underlying transition dynamics, hence requiring online reinforcement learning (RL). We present Bayesian Learning for Contextual RMABs (BCoR), an online RL approach for RMABs that novelly combines techniques in Bayesian modeling with Thompson sampling to flexibly model the complex RMAB settings present in public health program adherence problems, namely context and non-stationarity. BCoR's key strength is the ability to leverage shared information within and between arms to learn the unknown RMAB transition dynamics quickly in intervention-scarce settings with relatively short time horizons, which is common in public health applications. Empirically, BCoR achieves substantially higher finite-sample performance over a range of experimental settings, including a setting using real-world adherence data that was developed in collaboration with ARMMAN, an NGO in India which runs a large-scale maternal mHealth program, showcasing BCoR practical utility and potential for real-world deployment.

Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits

TL;DR

This paper tackles efficient allocation of scarce interventions in public health by modeling beneficiary adherence as a contextual and non-stationary restless RMAB with

arms, budget

, and horizon

. It introduces BCoR, a Bayesian contextual RMAB method that blends hierarchical Bayesian modeling with Thompson sampling to share information within arms and across arms, and to handle non-stationarity via spline-based time effects. Key contributions include the articulated Bayesian learning framework for

with within-arm and across-arm sharing, the use of a Whittle-index policy for online arm selection, and extensive empirical validation on both simulated settings and a real ARMMAN data-driven scenario. BCoR demonstrates substantial finite-sample gains over strong baselines (including a

increase in engagement in the ARMMAN experiment with

), supporting its practical potential for deployment in large-scale mHealth programs.

Abstract

Paper Structure (35 sections, 12 equations, 18 figures, 1 algorithm)

This paper contains 35 sections, 12 equations, 18 figures, 1 algorithm.

Introduction
Main Contributions
Related Work
Online RL for RMABs
Incorporating Contextual Information
Allowing for Non-Stationarity in RMABs
Other Related Learning Approaches
Problem Setting
The BCoR Algorithm
Learning the Transition Dynamics
Sharing Information Within an Arm
Sharing Information Across Arms
Addressing Non-Stationarity
Online Arm Selection
Experiments
...and 20 more sections

Figures (18)

Figure 1: We generate all RMAB instances using $N=400,T=50$, and$~B=10$, i.e.,$~B$ is $2.5\%$ of$~N$, across $1{,}000$ random seeds. The covariate matrix $\bm{X}$ is randomly generated with $k=4$ (two continuous covariates and two categorical) across the random seeds. The various RMAB simulation settings are detailed in Section \ref{['sec:simulation_exp']} and can be summarized as (a) a well-specified setting (no components of Model \ref{['final_model']} are zero'ed out), (b) a setting where passive actions are uninformative of active actions ($b_0=b_1=0$), (c) a stationary setting ($\bm{\eta}^{(s,a)} ={\color{black}\bm{0}, \forall s,a}$), (d) a setting with uninformative covariate information ($\bm{\mu}_{\bm{\beta}}=0$, $\bm{\beta}^{(s,a)} =0, {\color{black}\forall s,a}$), and (e) a highly misspecified setting, i.e., one where the RMAB instances are stationary with no information sharing between or within the arms. Lines represent the time-averaged reward of each method averaged over the $1{,}000$ random seeds with the Random baseline subtracted out. Error bars depict $\pm 2$ SEs.
Figure 2: Performance of various methods on the ARMMAN data-driven example described in Section \ref{['sec:real_data_exp']} with $N=500, T=40$, and varying budget$~B$, where all $B\leq 5\%$ of $N$ to reflect the magnitude of real-world budget constraints. Lines represent the time-averaged reward of each method averaged over $100$ random seeds with the Random baseline subtracted out. Note, the grey line is an oracle approach with access to the true transitions. Error bars depict $\pm 2$ SEs. UCWhittle performs worse than random across all settings, which can sometimes occurs when the budget is relatively small and the time horizon is short, though it recovers over a longer time horizon; see Figure \ref{['fig:ucwhittle_worse']}.
Figure 3: Implied priors on transition probabilities using (a) a wide prior on the model parameters, $b_0 \sim \mathcal{N}\left(0, 2^2 \right), b_1 \sim \mathcal{N}\left(0, 2^2 \right), \bm{\mu}_{\bm{\beta}} \sim \mathcal{N}\left(\bm{0}_k, 2^2 I_{k \times k}\right), \tau^2_{\alpha^{(s, a)}} \sim \text{Inv-Gamma}(100, 1), \bm{\beta}^{(s, a)} \sim \mathcal{N}\left(\bm{\mu}_{\bm{\beta}}, 2^2 I_{k \times k}\right),$ and $\bm{\eta}^{(s, a)} \sim \mathcal{N}\left(\bm{0}_d, 2^2 I_{d \times d}\right)$, and (b) the prior specified in Model \ref{['prior']}. Hence, in (b), the prior variances are set much wider than what was used for the experimental results in this paper (represented by (a)). Histograms show transitions probabilities for RMAB instances with $N=400, T=50, B=10$ generated using Model \ref{['final_model']} across $50$ random seeds. The covariate matrix $\bm{X}$ and the spline matrix $\bm{B}$ are generated as described in Section \ref{['sec:appendix_sim_only']}. Note that when using a wide prior, the transition probabilities tend to concentrate around $0$ and $1$, which is not representative of most realistic examples.
Figure 4: We generate RMAB instances using $N=400,T=50$, and$~B=10$, i.e.,$~B$ is $2.5\%$ of$~N$, across $1{,}000$ random seeds. The covariate matrix $\bm{X}$ was randomly generated with $k=4$ (two continuous covariates and two that are categorical) across the random seeds. The various RMAB simulation settings are detailed in Section \ref{['sec:simulation_exp']} and can be summarized as (a) a well-specified setting (no components of Model \ref{['final_model']} are zero'ed out), (b) a setting where passive actions are uninformative of active actions ($b_0=b_1=0$), (c) a stationary setting ($\bm{\eta}^{(s,a)} ={\color{black}\bm{0}, \forall s,a}$), (d) a setting with uninformative covariate information ($\bm{\mu}_{\bm{\beta}}=0$, all $\bm{\beta}^{(s,a)} =0, {\color{black}\forall s,a}$), and (e) a highly misspecified setting ,i.e., one where the RMAB instances are stationary with no information sharing between or within the arms. Lines represent the time-averaged reward of each method averaged over $1{,}000$ independent instances with the Random baseline subtracted out. Error bars depict $\pm 2$ SEs.
Figure 5: Changing the number of covariates: We generate RMAB instances using $N=400,T=50$, and$~B=10$, i.e.,$~B$ is $2.5\%$ of$~N$, across $1{,}000$ random seeds. The covariate matrix $\bm{X}$ is randomly generated with $k=8$ (five continuous covariates and three categorical generated as described in Section \ref{['sec:appendix_sim_only']}, adding another Bern$(0.5)$ covariate and three additional $\mathcal{N}(0, 1)$ distributed continuous covariates) across the random seeds. The various RMAB simulation settings are detailed in Section \ref{['sec:simulation_exp']} and can be summarized as (a) a well-specified setting (no components of Model \ref{['final_model']} are zero'ed out), (b) a setting where passive actions are uninformative of active actions ($b_0=b_1=0$), (c) a stationary setting ($\bm{\eta}^{(s,a)} ={\color{black}\bm{0}, \forall s,a}$), (d) a setting with uninformative covariate information ($\bm{\mu}_{\bm{\beta}}=0$, all $\bm{\beta}^{(s,a)} =0, {\color{black}\forall s,a}$), and (e) a highly misspecified setting ,i.e., one where the RMAB instances are stationary with no information sharing between or within the arms. Lines represent the time-averaged reward of each method averaged over the $1{,}000$ random seeds with the Random baseline subtracted out. Error bars depict $\pm 2$ SEs.
...and 13 more figures

Theorems & Definitions (2)

Definition 4.1: The BCoR Learning Model
Definition 2.1: Whittle index

Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits

TL;DR

Abstract

Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits

Authors

TL;DR

Abstract

Table of Contents

Figures (18)

Theorems & Definitions (2)