Table of Contents
Fetching ...

Power Constrained Nonstationary Bandits with Habituation and Recovery Dynamics

Fengxu Li, Stephanie M. Carpenter, Matthew P. Buman, Yonatan Mintz

TL;DR

This work addresses learning in nonstationary bandits where rewards evolve with habituation and recovery by introducing the ROGUE framework and a Thompson Sampling policy (ROGUE-TS) that achieves sublinear regret under dynamic conditions. To support population-level causal inference in MRTs, the authors add a power-constrained clipping mechanism that enforces minimum exploration, yielding near-optimal regret while preserving statistical power for hypothesis testing across multiple arms. They establish regret guarantees for ROGUE-TS and the clipped variant, and validate performance on real MRT datasets for physical activity and bipolar disorder, showing improved regret with sustained statistical power. The results offer practical guidance for designing MRTs that balance personalized intervention delivery with rigorous population-level inference, including how prior data and adaptive exploration can be leveraged to manage safety and burden.

Abstract

A common challenge for decision makers is selecting actions whose rewards are unknown and evolve over time based on prior policies. For instance, repeated use may reduce an action's effectiveness (habituation), while inactivity may restore it (recovery). These nonstationarities are captured by the Reducing or Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world settings such as behavioral health interventions. While existing algorithms can compute sublinear regret policies to optimize these settings, they may not provide sufficient exploration due to overemphasis on exploitation, limiting the ability to estimate population-level effects. This is a challenge of particular interest in micro-randomized trials (MRTs) that aid researchers in developing just-in-time adaptive interventions that have population-level effects while still providing personalized recommendations to individuals. In this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored to the ROGUE framework, and provide theoretical guarantees of sublinear regret. We then introduce a probability clipping procedure to balance personalization and population-level learning, with quantified trade-off that balances regret and minimum exploration probability. Validation on two MRT datasets concerning physical activity promotion and bipolar disorder treatment shows that our methods both achieve lower regret than existing approaches and maintain high statistical power through the clipping procedure without significantly increasing regret. This enables reliable detection of treatment effects while accounting for individual behavioral dynamics. For researchers designing MRTs, our framework offers practical guidance on balancing personalization with statistical validity.

Power Constrained Nonstationary Bandits with Habituation and Recovery Dynamics

TL;DR

This work addresses learning in nonstationary bandits where rewards evolve with habituation and recovery by introducing the ROGUE framework and a Thompson Sampling policy (ROGUE-TS) that achieves sublinear regret under dynamic conditions. To support population-level causal inference in MRTs, the authors add a power-constrained clipping mechanism that enforces minimum exploration, yielding near-optimal regret while preserving statistical power for hypothesis testing across multiple arms. They establish regret guarantees for ROGUE-TS and the clipped variant, and validate performance on real MRT datasets for physical activity and bipolar disorder, showing improved regret with sustained statistical power. The results offer practical guidance for designing MRTs that balance personalized intervention delivery with rigorous population-level inference, including how prior data and adaptive exploration can be leveraged to manage safety and burden.

Abstract

A common challenge for decision makers is selecting actions whose rewards are unknown and evolve over time based on prior policies. For instance, repeated use may reduce an action's effectiveness (habituation), while inactivity may restore it (recovery). These nonstationarities are captured by the Reducing or Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world settings such as behavioral health interventions. While existing algorithms can compute sublinear regret policies to optimize these settings, they may not provide sufficient exploration due to overemphasis on exploitation, limiting the ability to estimate population-level effects. This is a challenge of particular interest in micro-randomized trials (MRTs) that aid researchers in developing just-in-time adaptive interventions that have population-level effects while still providing personalized recommendations to individuals. In this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored to the ROGUE framework, and provide theoretical guarantees of sublinear regret. We then introduce a probability clipping procedure to balance personalization and population-level learning, with quantified trade-off that balances regret and minimum exploration probability. Validation on two MRT datasets concerning physical activity promotion and bipolar disorder treatment shows that our methods both achieve lower regret than existing approaches and maintain high statistical power through the clipping procedure without significantly increasing regret. This enables reliable detection of treatment effects while accounting for individual behavioral dynamics. For researchers designing MRTs, our framework offers practical guidance on balancing personalization with statistical validity.

Paper Structure

This paper contains 32 sections, 15 theorems, 84 equations, 9 figures, 3 algorithms.

Key Result

Theorem 1

For any $T\in \mathbb{N}$: $R_\Pi(T) \leq \frac{8}{3}L_g^2 |\mathcal{A}|^{\frac{1}{4}}T^{\frac{3}{4}}\sqrt{ 2\sigma c_f(d_x,d_\theta) + 4L_p\sigma^2\sqrt{\log(T)}} +2C_g\min\{T,|\mathcal{A}|\}.$

Figures (9)

  • Figure 1: Average reward for Gaussian ROGUE simulation
  • Figure 2: Cumulative regret for Gaussian ROGUE simulation
  • Figure 3: Average reward for MRTs simulation
  • Figure 4: Cumulative regret for MRTs simulation
  • Figure 5: Type I Error on ROGUE-GLM
  • ...and 4 more figures

Theorems & Definitions (18)

  • Theorem 1
  • Proposition 1
  • Definition 1
  • Theorem 2: Concentration Bound, Corollary 1 in mintz2020nonstationary
  • Proposition 2
  • Lemma 1
  • Proposition 3
  • Lemma 2
  • Theorem 3
  • Proposition 4
  • ...and 8 more