The Real Price of Bandit Information in Multiclass Classification

Liad Erez; Alon Cohen; Tomer Koren; Yishay Mansour; Shay Moran

The Real Price of Bandit Information in Multiclass Classification

Liad Erez, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran

TL;DR

This work resolves the fundamental question of how bandit feedback impacts minimax regret in single-label multiclass classification with finite hypothesis classes. It introduces a novel FTRL-based algorithm that combines negative entropy and log-barrier regularization, reduces bandit multiclass to a sparse contextual bandit problem, and attains a near-optimal regret of $\widetilde{O}(|\mathcal{H}| + \sqrt{T})$ for bandit multiclass (and $\widetilde{O}(|\Pi| + \sqrt{sT})$ in the general sparse contextual setting). A matching lower bound (up to log factors) shows this rate is tight across regimes, with a complementary bound $\widetilde{\Theta}(\min\{|H| + \sqrt{T}, \sqrt{KT \log |H|}\})$ capturing the price of bandit information. The results reveal that for moderately sized hypothesis classes there is little penalty from bandit feedback, while for larger classes the classic $\sqrt{KT}$-type dependence remains unavoidable, clarifying the fundamental trade-offs in bandit multiclass learning. Practically, the method provides improved performance guarantees in settings with a small to moderate number of hypotheses and a large label set, and establishes a clear benchmark for future algorithmic and complexity analyses in bandit contextual classification.

Abstract

We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetildeΘ\left(\min \left\{|H| + \sqrt{T}, \sqrt{KT \log |H|} \right\} \right) }$, where $H$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|H|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.

The Real Price of Bandit Information in Multiclass Classification

TL;DR

for bandit multiclass (and

in the general sparse contextual setting). A matching lower bound (up to log factors) shows this rate is tight across regimes, with a complementary bound

capturing the price of bandit information. The results reveal that for moderately sized hypothesis classes there is little penalty from bandit feedback, while for larger classes the classic

-type dependence remains unavoidable, clarifying the fundamental trade-offs in bandit multiclass learning. Practically, the method provides improved performance guarantees in settings with a small to moderate number of hypotheses and a large label set, and establishes a clear benchmark for future algorithmic and complexity analyses in bandit contextual classification.

Abstract

We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of

possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels

, and whether

-step regret bounds in this setting can be improved beyond the

dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form

, where

is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret

, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.

Paper Structure (25 sections, 7 theorems, 65 equations, 1 algorithm)

This paper contains 25 sections, 7 theorems, 65 equations, 1 algorithm.

Introduction
Summary of our contributions
Overview of main ideas and techniques
Open problems and future work
Additional related work
Bandit multiclass classification.
Contextual bandits.
Log-barrier regularization.
Problem setup
Bandit multiclass classification.
Learning objective.
Types of environment.
Main algorithm and upper bounds
Reduction to Sparse Contextual Bandits
Algorithm for Sparse Contextual Bandits
...and 10 more sections

Key Result

Theorem 1

Let $\Pi \subseteq \brk[c]{\mathcal{X} \to \mathcal{A}}$ be a finite policy class of size $N$ where $\abs{\mathcal{A}} = K$, and let $T \geq 1$. Then for any $s$-sparse contextual bandit instance over $\Pi$, the expected regret of alg:alg with $\eta = \sqrt{{\log(N)}/{sT}}$, $\nu = {1}/{16}$ and $\v

Theorems & Definitions (15)

Theorem 1
Corollary 1
proof : Proof (sketch)
Theorem 2
proof : Proof of \ref{['thm:bandit_multiclass_lb']} (sketch)
Lemma 1
Lemma 2
proof : Proof of \ref{['thm:upper-bound-sparse']}
Theorem 3
proof
...and 5 more

The Real Price of Bandit Information in Multiclass Classification

TL;DR

Abstract

The Real Price of Bandit Information in Multiclass Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (15)