Adversarial Multi-dueling Bandits

Pratik Gajane

Adversarial Multi-dueling Bandits

Pratik Gajane

TL;DR

A novel algorithm, MiDEX (Multi Dueling EXP3), is introduced to learn from preference feedback that is assumed to be generated from a pairwise-subset choice model, and it is proved that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$.

Abstract

We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $Ω(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Adversarial Multi-dueling Bandits

TL;DR

-round regret of MiDEX compared to a Borda-winner from a set of

arms is upper bounded by

Abstract

arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative

-round regret of MiDEX compared to a Borda-winner from a set of

arms is upper bounded by

. Moreover, we prove a lower bound of

for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Paper Structure (19 sections, 21 theorems, 30 equations, 3 algorithms)

This paper contains 19 sections, 21 theorems, 30 equations, 3 algorithms.

Introduction
Related Work
Problem Setting
Pairwise-subset Choice Model
Performance Measure: Regret
Our Algorithm and Performance Guarantee
Mathematical Analysis
Proof of Theorem \ref{['thm:main']}
Varying $m_t$
Lower Bound
Concluding Remarks
Proof of Lemma \ref{['lem:ExpG']}
Proof of Lemma \ref{['lem:ExpScore']}
Proof of Lemma \ref{['lem:GUpperBound']}
Proof of Lemma \ref{['lem:ScoreUpperBound']}
...and 4 more sections

Key Result

Proposition 1

The shifted Borda score $s_t(i)$ of any arm $i \in [K]$ is related to its Borda score $b_t(i)$ by the equation

Theorems & Definitions (35)

Definition 1: Borda Score
Definition 2: Regret
Definition 3: Shifted Borda Score
Definition 4: Shifted Borda Regret
Proposition 1
proof
Proposition 2
Proposition 3
Theorem 1
Lemma 1
...and 25 more

Adversarial Multi-dueling Bandits

TL;DR

Abstract

Adversarial Multi-dueling Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (35)