Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms
Khashayar Khosravi, Renato Paes Leme, Chara Podimata, Apostolis Tsorvantzis
TL;DR
This work introduces Bandits with Deterministically Evolving States (B-DES), a bandit framework where rewards depend on an unobserved, evolving state $q_t$ that updates as $q_{t+1}=(1-\lambda)q_t+\lambda b_{I_t}$. By treating long-term state effects via the DES regret benchmark, the authors develop algorithms that achieve sublinear regret across the full spectrum of evolution rates $\lambda$, including a DP-based offline planner with approximations, estimators for arm parameters, and regime-specific strategies for slow, fast, and sticky dynamics. They establish several regret bounds (e.g., $\widetilde{\mathcal{O}}(K^{1/3}T^{2/3})$, $\widetilde{\mathcal{O}}(K\sqrt{T})$, and $\widetilde{\mathcal{O}}(\sqrt{KT\log K})$ in different $\lambda$-regimes) and demonstrate robustness to model misspecifications such as noise and unknown $\lambda$, making the approach applicable to online ads and content recommendation where user states evolve with exposure. The work advances a principled, algorithmic treatment of evolving-state bandits with long-term impact, offering practical strategies for online platforms to balance immediate rewards and long-term health of the system.
Abstract
We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $λ\in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $λ$ and we show the robustness of our results to various model misspecifications.
