The Real Price of Bandit Information in Multiclass Classification
Liad Erez, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran
TL;DR
This work resolves the fundamental question of how bandit feedback impacts minimax regret in single-label multiclass classification with finite hypothesis classes. It introduces a novel FTRL-based algorithm that combines negative entropy and log-barrier regularization, reduces bandit multiclass to a sparse contextual bandit problem, and attains a near-optimal regret of $\widetilde{O}(|\mathcal{H}| + \sqrt{T})$ for bandit multiclass (and $\widetilde{O}(|\Pi| + \sqrt{sT})$ in the general sparse contextual setting). A matching lower bound (up to log factors) shows this rate is tight across regimes, with a complementary bound $\widetilde{\Theta}(\min\{|H| + \sqrt{T}, \sqrt{KT \log |H|}\})$ capturing the price of bandit information. The results reveal that for moderately sized hypothesis classes there is little penalty from bandit feedback, while for larger classes the classic $\sqrt{KT}$-type dependence remains unavoidable, clarifying the fundamental trade-offs in bandit multiclass learning. Practically, the method provides improved performance guarantees in settings with a small to moderate number of hypotheses and a large label set, and establishes a clear benchmark for future algorithmic and complexity analyses in bandit contextual classification.
Abstract
We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetildeΘ\left(\min \left\{|H| + \sqrt{T}, \sqrt{KT \log |H|} \right\} \right) }$, where $H$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|H|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.
