Utility-based Dueling Bandits as a Partial Monitoring Game

Pratik Gajane; Tanguy Urvoy

Utility-based Dueling Bandits as a Partial Monitoring Game

Pratik Gajane, Tanguy Urvoy

TL;DR

The paper casts utility-based dueling bandits within the partial monitoring (PM) framework and proves the binary case is locally observable, placing it in the easy PM regime with $\tilde{\Theta}(\sqrt{T})$ regret. It surveys PM algorithms and discusses how they could solve dueling bandits efficiently, noting that the rex3 algorithm attains $\tilde{O}(\sqrt{KT})$ regret while many PM methods incur $\tilde{O}(K\sqrt{T})$ due to the quadratic action count $N\approx K^2$. The PM perspective clarifies how relative feedback and signal observability shape regret, suggesting that structure-exploiting PM methods are well-suited for dueling bandits. Overall, the work provides a foundational PM-based pathway for improving online learning with partial feedback in dueling settings, and identifies rex3 as a particularly promising approach for the natural action scale.

Abstract

Partial monitoring is a generic framework for sequential decision-making with incomplete feedback. It encompasses a wide class of problems such as dueling bandits, learning with expect advice, dynamic pricing, dark pools, and label efficient prediction. We study the utility-based dueling bandit problem as an instance of partial monitoring problem and prove that it fits the time-regret partial monitoring hierarchy as an easy - i.e. Theta (sqrt{T})- instance. We survey some partial monitoring algorithms and see how they could be used to solve dueling bandits efficiently. Keywords: Online learning, Dueling Bandits, Partial Monitoring, Partial Feedback, Multiarmed Bandits

Utility-based Dueling Bandits as a Partial Monitoring Game

TL;DR

The paper casts utility-based dueling bandits within the partial monitoring (PM) framework and proves the binary case is locally observable, placing it in the easy PM regime with

regret. It surveys PM algorithms and discusses how they could solve dueling bandits efficiently, noting that the rex3 algorithm attains

regret while many PM methods incur

due to the quadratic action count

. The PM perspective clarifies how relative feedback and signal observability shape regret, suggesting that structure-exploiting PM methods are well-suited for dueling bandits. Overall, the work provides a foundational PM-based pathway for improving online learning with partial feedback in dueling settings, and identifies rex3 as a particularly promising approach for the natural action scale.

Utility-based Dueling Bandits as a Partial Monitoring Game

TL;DR

Abstract

Utility-based Dueling Bandits as a Partial Monitoring Game

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (7)