Table of Contents
Fetching ...

Bandits with Abstention under Expert Advice

Stephen Pasteris, Alberto Rumi, Maximilian Thiessen, Shota Saito, Atsushi Miyauchi, Fabio Vitale, Mark Herbster

TL;DR

The CBA algorithm is proposed, which exploits the assumption that one action corresponding to the learner's abstention from play, has no reward or loss on every trial, and is the first to achieve bounds on the expected cumulative reward for general confidence-rated predictors.

Abstract

We study the classic problem of prediction with expert advice under bandit feedback. Our model assumes that one action, corresponding to the learner's abstention from play, has no reward or loss on every trial. We propose the CBA algorithm, which exploits this assumption to obtain reward bounds that can significantly improve those of the classical Exp4 algorithm. We can view our problem as the aggregation of confidence-rated predictors when the learner has the option of abstention from play. Importantly, we are the first to achieve bounds on the expected cumulative reward for general confidence-rated predictors. In the special case of specialists we achieve a novel reward bound, significantly improving previous bounds of SpecialistExp (treating abstention as another action). As an example application, we discuss learning unions of balls in a finite metric space. In this contextual setting, we devise an efficient implementation of CBA, reducing the runtime from quadratic to almost linear in the number of contexts. Preliminary experiments show that CBA improves over existing bandit algorithms.

Bandits with Abstention under Expert Advice

TL;DR

The CBA algorithm is proposed, which exploits the assumption that one action corresponding to the learner's abstention from play, has no reward or loss on every trial, and is the first to achieve bounds on the expected cumulative reward for general confidence-rated predictors.

Abstract

We study the classic problem of prediction with expert advice under bandit feedback. Our model assumes that one action, corresponding to the learner's abstention from play, has no reward or loss on every trial. We propose the CBA algorithm, which exploits this assumption to obtain reward bounds that can significantly improve those of the classical Exp4 algorithm. We can view our problem as the aggregation of confidence-rated predictors when the learner has the option of abstention from play. Importantly, we are the first to achieve bounds on the expected cumulative reward for general confidence-rated predictors. In the special case of specialists we achieve a novel reward bound, significantly improving previous bounds of SpecialistExp (treating abstention as another action). As an example application, we discuss learning unions of balls in a finite metric space. In this contextual setting, we devise an efficient implementation of CBA, reducing the runtime from quadratic to almost linear in the number of contexts. Preliminary experiments show that CBA improves over existing bandit algorithms.
Paper Structure (19 sections, 7 theorems, 63 equations, 6 figures, 3 algorithms)

This paper contains 19 sections, 7 theorems, 63 equations, 6 figures, 3 algorithms.

Key Result

Theorem 3.1

CBA takes parameters $\eta\in(0,1)$ and $\boldsymbol{w}_1\in\mathbb{R}_+^E$ . For any $\boldsymbol{u}\in\mathcal{V}$ the expected cumulative reward of CBA is bounded below by: where the expectations are with respect to the randomization of CBA's strategy. The per-trial time complexity of CBA is in $\mathcal{O}(KE)$.

Figures (6)

  • Figure 1: Illustrative example of abstention where we cover the foreground and background classes with metric balls. We consider two clusters (blue and orange) as the foreground and one background class (white), using the shortest path $d_\infty$ metric. Using abstention, we can cover two clusters with one ball for each and abstain the background with no balls required (Fig. \ref{['fig:example1']}). In contrast, if we treat the background class as another class, it would require significantly more balls to cover the background class, as seen by the 10 gray balls in Fig. \ref{['fig:example2']}. If the number of balls to cover significantly increases like in this case, the bound involving the number of balls also gets significantly worse.
  • Figure 2: Results regarding the number of mistakes over time, the four main settings are presented from left to right: the Stochastic Block Model, Gaussian graph, Cora graph and LastFM Asia graph. In this context, D1, D2, and D-INF represent the $p$-norm bases, LVC represents the community detection basis, and INT represents the interval basis. The baselines, EXP3 for each context, Contextual Bandit with similarity, and GABA-II, are denoted as EXP3, CBSim, and GABA, respectively, and are represented with dashed lines. All the figures display the data with 95% confidence intervals over 20 runs, calculated using the standard error multiplied by the $z$-score 1.96.
  • Figure 3: Stochastic Block Model results, dotted lines represent different baselines, while solid lines are used to represent various results.
  • Figure 4: Gaussian graph results, dotted lines represent different baselines, while solid lines are used to represent various results.
  • Figure 5: Cora results, dotted lines represent different baselines, while solid lines are used to represent various results
  • ...and 1 more figures

Theorems & Definitions (12)

  • Theorem 3.1
  • Corollary 5.1
  • proof
  • Proposition 5.2
  • Theorem 5.3
  • Lemma A.1
  • proof
  • proof : Proof of Theorem \ref{['cbath']}
  • Proposition C.1
  • proof
  • ...and 2 more