Non-Stationary Latent Auto-Regressive Bandits
Anna L. Trella, Walter Dempsey, Asim H. Gazi, Ziping Xu, Finale Doshi-Velez, Susan A. Murphy
TL;DR
This work addresses non-stationary rewards in multi-armed bandits by modeling the mean rewards as driven by a latent autoregressive state $z_t$. It mounts a reduction to a linear dynamical system and solves it online as a linear contextual bandit via Latent AR LinUCB (LARL), effectively approximating a steady-state Kalman filter without requiring offline parameter learning. The authors derive an interpretable regret bound against the dynamic oracle, showing sub-linear regret when latent-state noise is sufficiently small relative to the horizon $T$, and demonstrate empirically that LARL outperforms stationary and non-stationary baselines across varying AR orders $k$. This approach enables principled handling of smooth, latent non-stationarity in online decision-making without budget constraints on non-stationarity, with potential impact in domains like digital health where restless, evolving contexts are common.
Abstract
For the non-stationary multi-armed bandit (MAB) problem, many existing methods allow a general mechanism for the non-stationarity, but rely on a budget for the non-stationarity that is sub-linear to the total number of time steps $T$. In many real-world settings, however, the mechanism for the non-stationarity can be modeled, but there is no budget for the non-stationarity. We instead consider the non-stationary bandit problem where the reward means change due to a latent, auto-regressive (AR) state. We develop Latent AR LinUCB (LARL), an online linear contextual bandit algorithm that does not rely on the non-stationary budget, but instead forms good predictions of reward means by implicitly predicting the latent state. The key idea is to reduce the problem to a linear dynamical system which can be solved as a linear contextual bandit. In fact, LARL approximates a steady-state Kalman filter and efficiently learns system parameters online. We provide an interpretable regret bound for LARL with respect to the level of non-stationarity in the environment. LARL achieves sub-linear regret in this setting if the noise variance of the latent state process is sufficiently small with respect to $T$. Empirically, LARL outperforms various baseline methods in this non-stationary bandit problem.
