Table of Contents
Fetching ...

Utility-based Dueling Bandits as a Partial Monitoring Game

Pratik Gajane, Tanguy Urvoy

TL;DR

The paper casts utility-based dueling bandits within the partial monitoring (PM) framework and proves the binary case is locally observable, placing it in the easy PM regime with $\tilde{\Theta}(\sqrt{T})$ regret. It surveys PM algorithms and discusses how they could solve dueling bandits efficiently, noting that the rex3 algorithm attains $\tilde{O}(\sqrt{KT})$ regret while many PM methods incur $\tilde{O}(K\sqrt{T})$ due to the quadratic action count $N\approx K^2$. The PM perspective clarifies how relative feedback and signal observability shape regret, suggesting that structure-exploiting PM methods are well-suited for dueling bandits. Overall, the work provides a foundational PM-based pathway for improving online learning with partial feedback in dueling settings, and identifies rex3 as a particularly promising approach for the natural action scale.

Abstract

Partial monitoring is a generic framework for sequential decision-making with incomplete feedback. It encompasses a wide class of problems such as dueling bandits, learning with expect advice, dynamic pricing, dark pools, and label efficient prediction. We study the utility-based dueling bandit problem as an instance of partial monitoring problem and prove that it fits the time-regret partial monitoring hierarchy as an easy - i.e. Theta (sqrt{T})- instance. We survey some partial monitoring algorithms and see how they could be used to solve dueling bandits efficiently. Keywords: Online learning, Dueling Bandits, Partial Monitoring, Partial Feedback, Multiarmed Bandits

Utility-based Dueling Bandits as a Partial Monitoring Game

TL;DR

The paper casts utility-based dueling bandits within the partial monitoring (PM) framework and proves the binary case is locally observable, placing it in the easy PM regime with regret. It surveys PM algorithms and discusses how they could solve dueling bandits efficiently, noting that the rex3 algorithm attains regret while many PM methods incur due to the quadratic action count . The PM perspective clarifies how relative feedback and signal observability shape regret, suggesting that structure-exploiting PM methods are well-suited for dueling bandits. Overall, the work provides a foundational PM-based pathway for improving online learning with partial feedback in dueling settings, and identifies rex3 as a particularly promising approach for the natural action scale.

Abstract

Partial monitoring is a generic framework for sequential decision-making with incomplete feedback. It encompasses a wide class of problems such as dueling bandits, learning with expect advice, dynamic pricing, dark pools, and label efficient prediction. We study the utility-based dueling bandit problem as an instance of partial monitoring problem and prove that it fits the time-regret partial monitoring hierarchy as an easy - i.e. Theta (sqrt{T})- instance. We survey some partial monitoring algorithms and see how they could be used to solve dueling bandits efficiently. Keywords: Online learning, Dueling Bandits, Partial Monitoring, Partial Feedback, Multiarmed Bandits

Paper Structure

This paper contains 10 sections, 3 theorems, 10 equations, 2 figures, 2 tables.

Key Result

Theorem 1

Let $\left(\boldsymbol{N},\boldsymbol{M},\boldsymbol{\Sigma}, \mathcal{L},\mathcal{H} \right)$ be a partial monitoring game. Let $\{ C_1, \dots, C_k\}$ be it's cell decomposition, with corresponding loss vectors $\boldsymbol{l}_1, \dots, \boldsymbol{l}_k$. The game falls into the following four regr

Figures (2)

  • Figure 1: Gain matrix $\mathcal{G}$ and feedback matrix $\mathcal{H}$ for a $4$-armed binary dueling bandits resulting in $10$ non-duplicate actions and $16$ possible outcomes.
  • Figure 2: Signal matrix for action $(12)$ for the same problem as in Figure \ref{['fig:matrices']}.

Theorems & Definitions (7)

  • Definition 1: Cells
  • Definition 2: Cell decomposition
  • Definition 3: Signal matrices
  • Definition 4: Observability
  • Theorem 1: Classification of partial monitoring problems
  • Theorem 2: Duelings bandits: locally observable
  • Corollary 3