Table of Contents
Fetching ...

Biased Dueling Bandits with Stochastic Delayed Feedback

Bongsoo Yi, Yue Kang, Yao Li

TL;DR

This work addresses biased dueling bandits under stochastic delays, a practical setting where feedback is not only delayed but also biased toward the second arm. It introduces two algorithms: RUCB-Delay, which requires full knowledge of the delay distribution and achieves regret matching the no-delay optimum, and MRR-DB-Delay, which only needs the expected delay and uses a round-robin elimination scheme. The authors establish high-probability regret bounds for RUCB-Delay and an $\mathbb{E}[\mathcal{R}(T)]$ bound for MRR-DB-Delay, and validate both approaches empirically on synthetic and real datasets, showing improvements over baselines and demonstrating practical viability in delayed-feedback environments. The work advances the understanding of how to handle bias and delays in relative-feedback settings, with implications for online advertising, recommendations, and information retrieval systems.

Abstract

The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.

Biased Dueling Bandits with Stochastic Delayed Feedback

TL;DR

This work addresses biased dueling bandits under stochastic delays, a practical setting where feedback is not only delayed but also biased toward the second arm. It introduces two algorithms: RUCB-Delay, which requires full knowledge of the delay distribution and achieves regret matching the no-delay optimum, and MRR-DB-Delay, which only needs the expected delay and uses a round-robin elimination scheme. The authors establish high-probability regret bounds for RUCB-Delay and an bound for MRR-DB-Delay, and validate both approaches empirically on synthetic and real datasets, showing improvements over baselines and demonstrating practical viability in delayed-feedback environments. The work advances the understanding of how to handle bias and delays in relative-feedback settings, with implications for online advertising, recommendations, and information retrieval systems.

Abstract

The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.
Paper Structure (29 sections, 22 theorems, 98 equations, 2 figures, 2 algorithms)

This paper contains 29 sections, 22 theorems, 98 equations, 2 figures, 2 algorithms.

Key Result

Proposition 1

$\hat{\mu}_{ij}(t)$ is a conditionally unbiased estimator of $\mu_{ij}$, when conditioning on the selections of arms.

Figures (2)

  • Figure 1: Case (a) illustrates the e-commerce example of the biased dueling bandit with delayed feedback: a delay occurs before the user can evaluate and potentially select Product B, and any data collected during this delay suggests $A > B$. Case (b) showcases the traditional dueling bandit with immediate feedback.
  • Figure 2: Average Cumulative Regret. The regret performances, averaged over 100 independent runs, along with their standard deviations, are reported for each dataset and algorithm.

Theorems & Definitions (42)

  • Proposition 1
  • proof
  • Lemma 1
  • Lemma 1
  • theorem 1
  • theorem 2
  • Remark 1
  • Remark 2
  • Remark 3
  • Lemma 1
  • ...and 32 more