Bandit and Delayed Feedback in Online Structured Prediction
Yuki Shibukawa, Taira Tsuchiya, Shinsaku Sakaue, Kenji Yamanishi
TL;DR
This work addresses online structured prediction (OSP) under weaker feedback: bandit and delayed feedback, as opposed to full-information. It builds on SELF and Fenchel–Young loss frameworks to relate target and surrogate losses, enabling efficient online optimization with guarantees. The authors present two bandit algorithms: one using inverse-weighted gradient estimators achieving $O\left(\sqrt{KT}\right)$ surrogate regret, and a second using a pseudo-inverse matrix estimator achieving $O\left(T^{2/3}\right)$ that is independent of $K$ for SELF* targets; they also extend the analysis to delayed feedback, obtaining tight bounds in both full-information and bandit contexts, with fixed and variable delays. In the delayed-bandit setting, they show $O\big(\sqrt{(K+D)T}\big)$ and $O\big(D^{1/3}T^{2/3}\big)$ regret bounds, and provide complexity analyses and experimental validation demonstrating practical gains when $K$ is large. Overall, the paper broadens the applicability of OSP by offering bandit- and delay-tolerant algorithms with provable surrogate-regret guarantees and showing how to leverage SELF structure to mitigate dependence on the output space size in challenging structured prediction tasks.
Abstract
Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, \textit{i.e.,}~the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, \textit{bandit} and \textit{delayed} feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.
