Utility-based Dueling Bandits as a Partial Monitoring Game
Pratik Gajane, Tanguy Urvoy
TL;DR
The paper casts utility-based dueling bandits within the partial monitoring (PM) framework and proves the binary case is locally observable, placing it in the easy PM regime with $\tilde{\Theta}(\sqrt{T})$ regret. It surveys PM algorithms and discusses how they could solve dueling bandits efficiently, noting that the rex3 algorithm attains $\tilde{O}(\sqrt{KT})$ regret while many PM methods incur $\tilde{O}(K\sqrt{T})$ due to the quadratic action count $N\approx K^2$. The PM perspective clarifies how relative feedback and signal observability shape regret, suggesting that structure-exploiting PM methods are well-suited for dueling bandits. Overall, the work provides a foundational PM-based pathway for improving online learning with partial feedback in dueling settings, and identifies rex3 as a particularly promising approach for the natural action scale.
Abstract
Partial monitoring is a generic framework for sequential decision-making with incomplete feedback. It encompasses a wide class of problems such as dueling bandits, learning with expect advice, dynamic pricing, dark pools, and label efficient prediction. We study the utility-based dueling bandit problem as an instance of partial monitoring problem and prove that it fits the time-regret partial monitoring hierarchy as an easy - i.e. Theta (sqrt{T})- instance. We survey some partial monitoring algorithms and see how they could be used to solve dueling bandits efficiently. Keywords: Online learning, Dueling Bandits, Partial Monitoring, Partial Feedback, Multiarmed Bandits
