Table of Contents
Fetching ...

Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems

Aditya Narayan Ravi, Pranav Poduval, Sharayu Moharir

TL;DR

The paper introduces Unreliable Multi-Armed Bandits to model recommendation systems where an autonomous user acts as an unreliable intermediary navigated by a perturbable Markov chain with transition matrix $P$ and perturbation $\delta$. It proves two key results: (1) with a genie revealing arm means, the optimal policy is $\pi^{\delta^{*}}$ that biases transitions toward the best arm, and (2) a linear regret lower bound $\mathbb{E}[C_T] \ge T(\mu^* - \tilde{\mu})$ where $\tilde{\mu}$ is the stationary-weighted mean after applying the optimal perturbation. Guided by these insights, the authors propose the Perturbation Explore ($\text{P}^{2}\text{EE}$) algorithm, which explores by steering toward the least-explored state and then exploits using $\pi^{\delta^{*}}$, achieving near-optimal performance in simulations. The work provides a principled framework for studying the trade-off between data collection costs and reward gains when user autonomy biases recommendations, and offers a concrete, perturbation-based strategy with theoretical backing and empirical validation.

Abstract

We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.

Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems

TL;DR

The paper introduces Unreliable Multi-Armed Bandits to model recommendation systems where an autonomous user acts as an unreliable intermediary navigated by a perturbable Markov chain with transition matrix and perturbation . It proves two key results: (1) with a genie revealing arm means, the optimal policy is that biases transitions toward the best arm, and (2) a linear regret lower bound where is the stationary-weighted mean after applying the optimal perturbation. Guided by these insights, the authors propose the Perturbation Explore () algorithm, which explores by steering toward the least-explored state and then exploits using , achieving near-optimal performance in simulations. The work provides a principled framework for studying the trade-off between data collection costs and reward gains when user autonomy biases recommendations, and offers a concrete, perturbation-based strategy with theoretical backing and empirical validation.

Abstract

We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.

Paper Structure

This paper contains 6 sections, 8 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: 2-state Markov chain a) Without the effect of recommendation system b) With the effect of recommendation system
  • Figure 2: $C_{T}$ vs $T$ comparison for $\text{P}^{2}\text{EE}$, Our method $P^2$EE has cumulative regret very close to the optimal "genie" and it's regret decreases as $\delta$ increases. UCB and greedy are clearly sub-optimal
  • Figure 3: Variation of $C_{T}$ as $\delta$ increases for $\text{P}^{2}\text{EE}$