Oralytics Reinforcement Learning Algorithm

Anna L. Trella; Kelly W. Zhang; Stephanie M. Carpenter; David Elashoff; Zara M. Greer; Inbal Nahum-Shani; Dennis Ruenger; Vivek Shetty; Susan A. Murphy

Oralytics Reinforcement Learning Algorithm

Anna L. Trella, Kelly W. Zhang, Stephanie M. Carpenter, David Elashoff, Zara M. Greer, Inbal Nahum-Shani, Dennis Ruenger, Vivek Shetty, Susan A. Murphy

TL;DR

This work presents Oralytics, an online Bayesian contextual bandit designed to personalize engagement prompts to improve oral self-care behaviors. It combines a Bayesian linear regression reward model with action centering, a fully pooled (cross-participant) learning approach, and a carefully constructed prior from pilot data, all learned and updated weekly in a clinical trial setting. The authors address practical challenges such as app-opening issues through a modified RL pipeline, simulated environments based on ROBAS data, and a monitoring system to ensure data integrity. Through extensive simulation experiments across stationary/non-stationary environments and varying participant responsivity, they determine final design decisions (full pooling, weekly updates, a specific smoothing slope, and tuned cost terms) and demonstrate how the surrogate rewards incorporating delayed effects guide learning while mitigating potential over-prompting burdens. The work advances scalable, data-efficient personalization for digital health interventions with methods that support robust after-study causal inference and real-world deployment.

Abstract

Dental disease is still one of the most common chronic diseases in the United States. While dental disease is preventable through healthy oral self-care behaviors (OSCB), this basic behavior is not consistently practiced. We have developed Oralytics, an online, reinforcement learning (RL) algorithm that optimizes the delivery of personalized intervention prompts to improve OSCB. In this paper, we offer a full overview of algorithm design decisions made using prior data, domain expertise, and experiments in a simulation test bed. The finalized RL algorithm was deployed in the Oralytics clinical trial, conducted from fall 2023 to summer 2024.

Oralytics Reinforcement Learning Algorithm

TL;DR

Abstract

Paper Structure (71 sections, 30 equations, 6 figures, 10 tables)

This paper contains 71 sections, 30 equations, 6 figures, 10 tables.

Introduction
Preliminaries
Available Data
Code
Algorithm Design Decisions
Overview
Fixed Decisions:
Decisions made using experiments in the simulation testbed:
Participant Onboarding Procedure and Prior Sampling Period
Proximal Outcome
Duration of Proximal Outcome Window
RL Framework
1. Choice of using a Contextual Bandit Algorithm Framework:
2. Choice of a Bayesian Framework:
Reward Approximating Function
...and 56 more sections

Figures (6)

Figure 1: Parameters of Pilot Data Fit To The Action-Centering Model.
Figure 2: Standard Effect Sizes From The Action-Centering Model.
Figure 3: Generalized logistic function with $L_{\min}=0.2$ (lower clipping), $L_{\max}=0.8$ (upper clipping), $c = 5$ (shift to right), $b = 20$. We show the function with $b = 20$ instead of the chosen $b=\frac{20}{\sigma_{\text{rvv}}} = 0.515$ to help behavioral scientists interpret the target probability of sending an engagement prompt given the treatment effect standardized by the residual standard deviation.
Figure 4: V2 Heatmap of Candidate Values for $\xi_1, \xi_2$. We evaluate candidate values $\xi_1$, the cost of sending engagement prompts for a high-performing brusher, and $\xi_2$, the cost of sending an engagement prompt regardless of participant performance (See Equation \ref{['cost_term']}). We consider two metrics across twelve simulation environment variants (stationary vs. non-stationary base model environment, effect size scales (small and smaller), and effect size shrinkage $E = [0, 0.5, 0.8]$ (small values of $E$ represent low participant robustness to habituation where $E=0$ represents the most severe susceptibility to habituation). The blue grids show simulations evaluated using average $\sum_{t=1}^{T} Q_{i, t}$ across participants and the purple grids show simulations evaluated using $25$th-percentile of $\sum_{t=1}^{T} Q_{i, t}$ across participants. The grid with the highest criteria value is boxed for readability.
Figure 5: Histogram of OSCB in ROBAS 3. OSCB across all 31 participants in ROBAS 3 across 140 brushing windows (2 brushing windows per day for 70 days). Since the ROBAS 3 study lasted for 90 days, but each participant for Oralytics will only be in the study for 70 days, we only take the first 70 days of data for each participant in ROBAS 3. Notice that the ROBAS 3 data set is highly zero-inflated.
...and 1 more figures

Oralytics Reinforcement Learning Algorithm

TL;DR

Abstract

Oralytics Reinforcement Learning Algorithm

Authors

TL;DR

Abstract

Table of Contents

Figures (6)