Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

Khashayar Khosravi; Renato Paes Leme; Chara Podimata; Apostolis Tsorvantzis

Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

Khashayar Khosravi, Renato Paes Leme, Chara Podimata, Apostolis Tsorvantzis

TL;DR

This work introduces Bandits with Deterministically Evolving States (B-DES), a bandit framework where rewards depend on an unobserved, evolving state $q_t$ that updates as $q_{t+1}=(1-\lambda)q_t+\lambda b_{I_t}$. By treating long-term state effects via the DES regret benchmark, the authors develop algorithms that achieve sublinear regret across the full spectrum of evolution rates $\lambda$, including a DP-based offline planner with approximations, estimators for arm parameters, and regime-specific strategies for slow, fast, and sticky dynamics. They establish several regret bounds (e.g., $\widetilde{\mathcal{O}}(K^{1/3}T^{2/3})$, $\widetilde{\mathcal{O}}(K\sqrt{T})$, and $\widetilde{\mathcal{O}}(\sqrt{KT\log K})$ in different $\lambda$-regimes) and demonstrate robustness to model misspecifications such as noise and unknown $\lambda$, making the approach applicable to online ads and content recommendation where user states evolve with exposure. The work advances a principled, algorithmic treatment of evolving-state bandits with long-term impact, offering practical strategies for online platforms to balance immediate rewards and long-term health of the system.

Abstract

We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $λ\in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $λ$ and we show the robustness of our results to various model misspecifications.

Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

TL;DR

This work introduces Bandits with Deterministically Evolving States (B-DES), a bandit framework where rewards depend on an unobserved, evolving state

that updates as

. By treating long-term state effects via the DES regret benchmark, the authors develop algorithms that achieve sublinear regret across the full spectrum of evolution rates

, including a DP-based offline planner with approximations, estimators for arm parameters, and regime-specific strategies for slow, fast, and sticky dynamics. They establish several regret bounds (e.g.,

, and

in different

-regimes) and demonstrate robustness to model misspecifications such as noise and unknown

, making the approach applicable to online ads and content recommendation where user states evolve with exposure. The work advances a principled, algorithmic treatment of evolving-state bandits with long-term impact, offering practical strategies for online platforms to balance immediate rewards and long-term health of the system.

Abstract

We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States (

). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate

at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate

and we show the robustness of our results to various model misspecifications.

Paper Structure (28 sections, 32 theorems, 134 equations, 2 figures, 1 table, 8 algorithms)

This paper contains 28 sections, 32 theorems, 134 equations, 2 figures, 1 table, 8 algorithms.

Introduction
Our Contributions
Related Work
Model & Preliminaries
Experimental evidence for the functional form in $\textsc{B-DES}$
External vs DES Regret
General Evolution Rate Algorithm
Relaxation: Dynamic Programming with Approximate Rewards
Estimating the IV Rewards and ES
Slow State Evolution: $\lambda \in [0, \widetilde{\Theta}(1/T)]$
Fast State Evolution: $\lambda \in [\widetilde{\Theta}(1 - 1/\sqrt{T}), 1]$
"Sticky" Arms: Evolution Rate $\lambda = 1$
Evolution Rate $\lambda \in [\widetilde{\Theta}(1 - 1/\sqrt{T}), 1)$
Robustness
Noise Perturbed Model
...and 13 more sections

Key Result

Proposition 2.1

Let algorithm $\textsf{\upshape ALG}$ be a no-external regret algorithm (e.g., UCB, AAE, EXP3 etc). For any such algorithm $\textsf{\upshape ALG}$, there exists a family of instances $\mathcal{I}$ for which $R_{\textsc{DES}}(T) = \Omega(T)$.

Figures (2)

Figure 1: State evolution function for a fixed arm with $b_i = 0.15$ and $\lambda = 0.5$.
Figure 2: External vs DES regret for an instance with $2$ arms and varying $\lambda$.

Theorems & Definitions (62)

Proposition 2.1
Theorem 3.1
Lemma 3.2
Lemma 3.3
Lemma 3.4
Lemma 3.5
Corollary 3.6
Lemma 3.7
Lemma 3.8
Lemma 3.9
...and 52 more

Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

TL;DR

Abstract

Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (62)