Online Recommendations for Agents with Discounted Adaptive Preferences

Arpit Agarwal; William Brown

Online Recommendations for Agents with Discounted Adaptive Preferences

Arpit Agarwal, William Brown

TL;DR

This work analyzes online menu-based recommendations where an agent's preferences evolve according to an unknown model with discounted memory. It develops a unifying framework distinguishing long-memory and short-memory regimes, introducing target benchmarks such as the everywhere-instantaneously realizable distributions (EIRD) and the φ-smoothed simplex for scale-bounded models. Central contributions include the Deferred Bandit Gradient (DBG) algorithm and MenuDist for efficiently realizing target distributions, achieving sublinear regret against EIRD under smooth preferences, and extending sublinear guarantees to nearly the full simplex under scale-boundedness. It also establishes NP-hardness for relaxing benchmarks beyond EIRD and demonstrates sublinear regret under short-memory conditions for scale-bounded models, while outlining hardness barriers for general preference dynamics. Together, these results illuminate how memory horizon, target set choice, and structural assumptions on preference functions shape attainable regret in adaptive online recommendations with multiple-item menus.

Abstract

We consider a bandit recommendations problem in which an agent's preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown $\textit{preference model}$. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some $\textit{target set}$ (a subset of the item simplex) for adversarial losses over the agent's choices. Extending the setting from Agarwal and Brown (2022), where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent's memory vector at each subsequent round. In the "long-term memory" regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of $\textit{everywhere instantaneously realizable distributions}$ (the "EIRD set", as formulated in prior work) for any $\textit{smooth}$ preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these "scale-bounded" preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the $\textit{entire}$ item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the "short-term memory" regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.

Online Recommendations for Agents with Discounted Adaptive Preferences

TL;DR

Abstract

. In each round, we show a menu of

items (out of

total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some

(a subset of the item simplex) for adversarial losses over the agent's choices. Extending the setting from Agarwal and Brown (2022), where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent's memory vector at each subsequent round. In the "long-term memory" regime (when the effective memory horizon scales with

sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of

(the "EIRD set", as formulated in prior work) for any

preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these "scale-bounded" preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the

item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the "short-term memory" regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.

Paper Structure (47 sections, 16 theorems, 84 equations, 7 algorithms)

This paper contains 47 sections, 16 theorems, 84 equations, 7 algorithms.

Introduction
Setting
Our Results
Related Work
Stochastic bandits with changing rewards
Models of preference dynamics
Reinforcement learning and recommendation systems
Dueling bandits
Preliminaries
Interaction Model
Realizable Distributions
Discounted Memory Agents
Smooth Preference Models
Targeting $\mathop{\mathrm{\textup{EIRD}}}\nolimits$ for Agents with Long Memory Horizons
Characterizing $\mathop{\mathrm{\textup{IRD}}}\nolimits$ via Menu Times
...and 32 more sections

Key Result

Lemma 1

An item distribution $x$ belongs to $\mathop{\mathrm{\textup{IRD}}}\nolimits(v, M)$ if and only if we have that the menu time $\mu_i$ for each item is at most $1$. If this condition holds, there is a $\mathop{\mathrm{\textup{poly}}}\nolimits(n)$ time algorithm MenuDist$(v, x, M)$ for constructing a

Theorems & Definitions (33)

Definition 1: Discounted Memory Updating
Definition 2: Smooth Preference Models
Lemma 1
Theorem 1
Theorem 2
Definition 3: Scale-Bounded Functions
Definition 4: $\phi$-Smoothed Simplex
Lemma 2
Theorem 3: Scale-Bounded Discounted Regret Bound
Theorem 4
...and 23 more

Online Recommendations for Agents with Discounted Adaptive Preferences

TL;DR

Abstract

Online Recommendations for Agents with Discounted Adaptive Preferences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (33)