Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

Yuval Filmus; Steve Hanneke; Idan Mehalel; Shay Moran

Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

Yuval Filmus, Steve Hanneke, Idan Mehalel, Shay Moran

TL;DR

This paper analyzes online multiclass classification in the mistake bound model under three axes: information (bandit vs full-information), adaptivity (adaptive vs oblivious adversaries), and randomness (randomized vs deterministic learners). It proves that the price of bandit feedback against adaptive randomized learners is at most a linear factor in the number of labels, $\, extsf{opt}_{bandit}^{adap}(\mathcal{H}) = O(\textsf{opt}_{full}^{rand}(\mathcal{H}) \cdot |\mathcal{Y}|)$, and this bound is tight up to constants; it also shows nearly optimal gaps between bandit randomized and deterministic learners, and between adaptive and oblivious adversaries. A central technical contribution is a tight bound for prediction with expert advice under bandit feedback in the $r$-realizable setting: $\,\textsf{opt}_{bandit}^{adap}(n,k,r) = \Theta\big(k(\log_k n + r)\big)$, with matching lower bounds for oblivious adversaries in certain regimes. The analysis leverages a reduction to prediction with expert advice and a dual game via minimax, combined with a pattern-class framework and a budgeted-expert potential to handle the $r$-realizable setting; these ideas extend to the agnostic setting using a doubling trick. Overall, the work clarifies how bandit feedback, adaptivity, and randomness shape worst-case performance in multiclass online learning and yields near-optimal, broadly applicable bounds with implications for designing bandit-feedback algorithms.

Abstract

Consider the domain of multiclass classification within the adversarial online setting. What is the price of relying on bandit feedback as opposed to full information? To what extent can an adaptive adversary amplify the loss compared to an oblivious one? To what extent can a randomized learner reduce the loss compared to a deterministic one? We study these questions in the mistake bound model and provide nearly tight answers. We demonstrate that the optimal mistake bound under bandit feedback is at most $O(k)$ times higher than the optimal mistake bound in the full information case, where $k$ represents the number of labels. This bound is tight and provides an answer to an open question previously posed and studied by Daniely and Helbertal ['13] and by Long ['17, '20], who focused on deterministic learners. Moreover, we present nearly optimal bounds of $\tildeΘ(k)$ on the gap between randomized and deterministic learners, as well as between adaptive and oblivious adversaries in the bandit feedback setting. This stands in contrast to the full information scenario, where adaptive and oblivious adversaries are equivalent, and the gap in mistake bounds between randomized and deterministic learners is a constant multiplicative factor of $2$. In addition, our results imply that in some cases the optimal randomized mistake bound is approximately the square-root of its deterministic parallel. Previous results show that this is essentially the smallest it can get.

Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

TL;DR

, and this bound is tight up to constants; it also shows nearly optimal gaps between bandit randomized and deterministic learners, and between adaptive and oblivious adversaries. A central technical contribution is a tight bound for prediction with expert advice under bandit feedback in the

-realizable setting:

, with matching lower bounds for oblivious adversaries in certain regimes. The analysis leverages a reduction to prediction with expert advice and a dual game via minimax, combined with a pattern-class framework and a budgeted-expert potential to handle the

-realizable setting; these ideas extend to the agnostic setting using a doubling trick. Overall, the work clarifies how bandit feedback, adaptivity, and randomness shape worst-case performance in multiclass online learning and yields near-optimal, broadly applicable bounds with implications for designing bandit-feedback algorithms.

Abstract

times higher than the optimal mistake bound in the full information case, where

represents the number of labels. This bound is tight and provides an answer to an open question previously posed and studied by Daniely and Helbertal ['13] and by Long ['17, '20], who focused on deterministic learners. Moreover, we present nearly optimal bounds of

on the gap between randomized and deterministic learners, as well as between adaptive and oblivious adversaries in the bandit feedback setting. This stands in contrast to the full information scenario, where adaptive and oblivious adversaries are equivalent, and the gap in mistake bounds between randomized and deterministic learners is a constant multiplicative factor of

. In addition, our results imply that in some cases the optimal randomized mistake bound is approximately the square-root of its deterministic parallel. Previous results show that this is essentially the smallest it can get.

Paper Structure (42 sections, 25 theorems, 84 equations, 2 figures, 2 tables)

This paper contains 42 sections, 25 theorems, 84 equations, 2 figures, 2 tables.

Introduction
Main questions and results
Information
A generalization to the agnostic setting.
Adaptivity
Randomness
Bounds for prediction with expert advice
Technical overview
Proof idea of Item (1)
Proof idea of Item (2)
Related work
Information
Adaptivity
Randomness
Prediction with expert advice
...and 27 more sections

Key Result

Theorem 1.1

For every concept class $\mathcal{H}$ it holds that Furthermore, for every natural $k \geq 2$ there exists a concept class $\mathcal{H}$ with $|\mathcal{Y}| = k$ such that

Figures (2)

Figure 1: $\mathsf{BanditRandSOA}$ is an optimal randomized learner for online learning with bandit feedback of pattern classes, where the adversary is allowed to be adaptive. It is inspired by the $\mathsf{RandSOA}$ algorithm of filmus2023optimal, which is a randomized variant of Littlestnoe's littlestone1988learning well-known $\mathsf{SOA}$ algorithm.
Figure 2: The "doubling trick" algorithm $\mathsf{DT}$.

Theorems & Definitions (46)

Theorem 1.1: Full-information vs. Bandit-feedback
Theorem 1.2: Oblivious vs. Adaptive Adversaries
Remark 1.3: Concept classes vs. Pattern classes
Theorem 1.4: Randomized vs. Deterministic
Theorem 1.5
Remark 2.1: Strong vs. Weak Realizability
Lemma 3.1
proof
Proposition 3.2
proof
...and 36 more

Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

TL;DR

Abstract

Bandit-Feedback Online Multiclass Classification: Variants and Tradeoffs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (46)