Restless Linear Bandits
Azadeh Khaleghi
TL;DR
This work generalizes linear bandits to a restless setting where the payoff-generating parameter sequence $\theta_t$ is stationary and φ-mixing, yielding time-dependent rewards $Y_t=\langle \theta_t, X_t\rangle$. It quantifies the cost of replacing a dynamic restless oracle with a static mean oracle via the bound $\nu_n-\widetilde{\nu}_n \le 2 n \varphi_1 \|\theta_t\|_{\mathcal{L}_{\infty}}$ and introduces LinMix-UCB, an optimistic algorithm that handles long-range dependencies under an exponential mixing rate $\varphi_m \le a e^{-\gamma m}$. LinMix-UCB achieves sublinear regret with respect to an oracle that always plays a multiple of $\mathbb{E}[\theta_t]$, with finite-horizon guarantees of the form $\mathcal{O}(\sqrt{d n \mathrm{polylog}(n)})$ and infinite-horizon guarantees via a doubling trick. The analysis relies on Berbee's coupling to generate near-independent samples and on confidence ellipsoids around $\theta^*$, bridging restless bandits and time-series concentration methods. The work opens avenues to relax the mixing assumption, learn mixing parameters online, and establish corresponding lower bounds.
Abstract
A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $\mathbb{R}^d$-valued stationary $\varphi$-mixing sequence of parameters $(θ_t,~t \in \mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $\varphi$-dependence between consecutive $θ_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $θ_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right)$ with respect to an oracle that always plays a multiple of $\mathbb{E}θ_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $\mathbb{E}θ_t$.
