Table of Contents
Fetching ...

Non-Stationary Latent Auto-Regressive Bandits

Anna L. Trella, Walter Dempsey, Asim H. Gazi, Ziping Xu, Finale Doshi-Velez, Susan A. Murphy

TL;DR

This work addresses non-stationary rewards in multi-armed bandits by modeling the mean rewards as driven by a latent autoregressive state $z_t$. It mounts a reduction to a linear dynamical system and solves it online as a linear contextual bandit via Latent AR LinUCB (LARL), effectively approximating a steady-state Kalman filter without requiring offline parameter learning. The authors derive an interpretable regret bound against the dynamic oracle, showing sub-linear regret when latent-state noise is sufficiently small relative to the horizon $T$, and demonstrate empirically that LARL outperforms stationary and non-stationary baselines across varying AR orders $k$. This approach enables principled handling of smooth, latent non-stationarity in online decision-making without budget constraints on non-stationarity, with potential impact in domains like digital health where restless, evolving contexts are common.

Abstract

For the non-stationary multi-armed bandit (MAB) problem, many existing methods allow a general mechanism for the non-stationarity, but rely on a budget for the non-stationarity that is sub-linear to the total number of time steps $T$. In many real-world settings, however, the mechanism for the non-stationarity can be modeled, but there is no budget for the non-stationarity. We instead consider the non-stationary bandit problem where the reward means change due to a latent, auto-regressive (AR) state. We develop Latent AR LinUCB (LARL), an online linear contextual bandit algorithm that does not rely on the non-stationary budget, but instead forms good predictions of reward means by implicitly predicting the latent state. The key idea is to reduce the problem to a linear dynamical system which can be solved as a linear contextual bandit. In fact, LARL approximates a steady-state Kalman filter and efficiently learns system parameters online. We provide an interpretable regret bound for LARL with respect to the level of non-stationarity in the environment. LARL achieves sub-linear regret in this setting if the noise variance of the latent state process is sufficiently small with respect to $T$. Empirically, LARL outperforms various baseline methods in this non-stationary bandit problem.

Non-Stationary Latent Auto-Regressive Bandits

TL;DR

This work addresses non-stationary rewards in multi-armed bandits by modeling the mean rewards as driven by a latent autoregressive state . It mounts a reduction to a linear dynamical system and solves it online as a linear contextual bandit via Latent AR LinUCB (LARL), effectively approximating a steady-state Kalman filter without requiring offline parameter learning. The authors derive an interpretable regret bound against the dynamic oracle, showing sub-linear regret when latent-state noise is sufficiently small relative to the horizon , and demonstrate empirically that LARL outperforms stationary and non-stationary baselines across varying AR orders . This approach enables principled handling of smooth, latent non-stationarity in online decision-making without budget constraints on non-stationarity, with potential impact in domains like digital health where restless, evolving contexts are common.

Abstract

For the non-stationary multi-armed bandit (MAB) problem, many existing methods allow a general mechanism for the non-stationarity, but rely on a budget for the non-stationarity that is sub-linear to the total number of time steps . In many real-world settings, however, the mechanism for the non-stationarity can be modeled, but there is no budget for the non-stationarity. We instead consider the non-stationary bandit problem where the reward means change due to a latent, auto-regressive (AR) state. We develop Latent AR LinUCB (LARL), an online linear contextual bandit algorithm that does not rely on the non-stationary budget, but instead forms good predictions of reward means by implicitly predicting the latent state. The key idea is to reduce the problem to a linear dynamical system which can be solved as a linear contextual bandit. In fact, LARL approximates a steady-state Kalman filter and efficiently learns system parameters online. We provide an interpretable regret bound for LARL with respect to the level of non-stationarity in the environment. LARL achieves sub-linear regret in this setting if the noise variance of the latent state process is sufficiently small with respect to . Empirically, LARL outperforms various baseline methods in this non-stationary bandit problem.
Paper Structure (31 sections, 14 theorems, 112 equations, 5 figures, 1 algorithm)

This paper contains 31 sections, 14 theorems, 112 equations, 5 figures, 1 algorithm.

Key Result

Lemma 3.2

(Linear Dynamical System) The latent state process (Equation latent_state) and the reward function (Equation linear_reward) in Definition def_non_stat_latent_auto_bandit form a special case of a linear dynamical system with Gaussian noise. The system has state vector $\vec{z}_t \in \mathbb{R}^{k}$ w where See Appendix linear_dynamic_system_proof for exact forms for $\Gamma, W, C, c_a$.

Figures (5)

  • Figure 1: Graphical Model for Non-Stationary Latent Auto-regressive Bandits.
  • Figure 2: Our algorithm LARL (blue), with $s$ chosen using BIC after a period of pure exploration, consistently achieves lower cumulative regret (Equation \ref{['eqn_regret']}) over time against various baseline methods. Line is the average and shaded region is $\pm$ standard deviation across 100 Monte Carlo simulated trials.
  • Figure 3: Pairwise comparisons between algorithms in the three variants of the simulation environment where $k = 1, 5, 10$, respectively. Each cell shows the proportion of 100 Monte-Carlo repetitions where the algorithm listed in the row achieved lower cumulative regret than the algorithm listed in the column. Our algorithm LARL (top row) consistently outperforms baseline methods in pairwise comparison.
  • Figure 4: Pairwise comparisons between algorithms in the environment variants where $k = 1, 5, 10$, respectively. Each cell shows the proportion of 100 Monte-Carlo repetitions where the algorithm listed in the row achieved lower cumulative regret than the algorithm listed in the column. Even when $s$ is not specifically tuned, our algorithm still outperforms Stationary.
  • Figure 5: Cumulative regret (Equation \ref{['eqn_regret']}) over time with varying choices of $s$ for our algorithm Latent AR LinUCB (Algorithm \ref{['alg_latent_ar_ucb']}). For a poor choice of $s$ (either too small or too large compared to $k$) however, our algorithm performs similarly to the stationary. If $s$ is too small, the reward model is under-parameterized. If $s$ is too large, the reward model is over-parameterized. Line is the average and shaded region is $\pm$ standard deviation across Monte Carlo simulated trials.

Theorems & Definitions (26)

  • Definition 3.1
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 4.1
  • proof
  • Theorem 4.2
  • proof
  • proof
  • ...and 16 more