Table of Contents
Fetching ...

Adversarial Multi-dueling Bandits

Pratik Gajane

TL;DR

A novel algorithm, MiDEX (Multi Dueling EXP3), is introduced to learn from preference feedback that is assumed to be generated from a pairwise-subset choice model, and it is proved that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$.

Abstract

We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $Ω(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

Adversarial Multi-dueling Bandits

TL;DR

A novel algorithm, MiDEX (Multi Dueling EXP3), is introduced to learn from preference feedback that is assumed to be generated from a pairwise-subset choice model, and it is proved that the expected cumulative -round regret of MiDEX compared to a Borda-winner from a set of arms is upper bounded by .

Abstract

We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative -round regret of MiDEX compared to a Borda-winner from a set of arms is upper bounded by . Moreover, we prove a lower bound of for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.
Paper Structure (19 sections, 21 theorems, 30 equations, 3 algorithms)

This paper contains 19 sections, 21 theorems, 30 equations, 3 algorithms.

Key Result

Proposition 1

The shifted Borda score $s_t(i)$ of any arm $i \in [K]$ is related to its Borda score $b_t(i)$ by the equation

Theorems & Definitions (35)

  • Definition 1: Borda Score
  • Definition 2: Regret
  • Definition 3: Shifted Borda Score
  • Definition 4: Shifted Borda Regret
  • Proposition 1
  • proof
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Lemma 1
  • ...and 25 more