Online Learning with Sublinear Best-Action Queries
Matteo Russo, Andrea Celli, Riccardo Colini Baldeschi, Federico Fusco, Daniel Haimovich, Dima Karamshuk, Stefano Leonardi, Niek Tax
TL;DR
This work studies online learning with a budget of at most $k$ best-action queries, where an oracle reveals the identity of the best action at a time step. It shows that in the full-feedback model, a Hedge-based approach with uniformly random queries achieves a minimax regret of $Θ(min{√T, T/k})$, and provides matching lower bounds, revealing a multiplicative gain from querying. In the label-efficient setting, updating only at queried times yields $Θ(min{T/√k, T^2/k^2})$ regret with tight lower bounds; in the stochastic i.i.d. setting, Follow-The-Leader and Explore-Then-Commit variants achieve $Θ(T/k)$ or $tilde{Θ}(√T)$ depending on $k$, with improvements over standard label-efficient rates. The results collectively show that even sublinear querying budgets can substantially improve regret, and they establish precise minimax rates across full and partial feedback, as well as in the stochastic regime, with potential practical impact for budgets-constrained prediction tasks.
Abstract
In online learning, a decision maker repeatedly selects one of a set of actions, with the goal of minimizing the overall loss incurred. Following the recent line of research on algorithms endowed with additional predictive features, we revisit this problem by allowing the decision maker to acquire additional information on the actions to be selected. In particular, we study the power of \emph{best-action queries}, which reveal beforehand the identity of the best action at a given time step. In practice, predictive features may be expensive, so we allow the decision maker to issue at most $k$ such queries. We establish tight bounds on the performance any algorithm can achieve when given access to $k$ best-action queries for different types of feedback models. In particular, we prove that in the full feedback model, $k$ queries are enough to achieve an optimal regret of $Θ\left(\min\left\{\sqrt T, \frac Tk\right\}\right)$. This finding highlights the significant multiplicative advantage in the regret rate achievable with even a modest (sublinear) number $k \in Ω(\sqrt{T})$ of queries. Additionally, we study the challenging setting in which the only available feedback is obtained during the time steps corresponding to the $k$ best-action queries. There, we provide a tight regret rate of $Θ\left(\min\left\{\frac{T}{\sqrt k},\frac{T^2}{k^2}\right\}\right)$, which improves over the standard $Θ\left(\frac{T}{\sqrt k}\right)$ regret rate for label efficient prediction for $k \in Ω(T^{2/3})$.
