Table of Contents
Fetching ...

Bandit and Delayed Feedback in Online Structured Prediction

Yuki Shibukawa, Taira Tsuchiya, Shinsaku Sakaue, Kenji Yamanishi

TL;DR

This work addresses online structured prediction (OSP) under weaker feedback: bandit and delayed feedback, as opposed to full-information. It builds on SELF and Fenchel–Young loss frameworks to relate target and surrogate losses, enabling efficient online optimization with guarantees. The authors present two bandit algorithms: one using inverse-weighted gradient estimators achieving $O\left(\sqrt{KT}\right)$ surrogate regret, and a second using a pseudo-inverse matrix estimator achieving $O\left(T^{2/3}\right)$ that is independent of $K$ for SELF* targets; they also extend the analysis to delayed feedback, obtaining tight bounds in both full-information and bandit contexts, with fixed and variable delays. In the delayed-bandit setting, they show $O\big(\sqrt{(K+D)T}\big)$ and $O\big(D^{1/3}T^{2/3}\big)$ regret bounds, and provide complexity analyses and experimental validation demonstrating practical gains when $K$ is large. Overall, the paper broadens the applicability of OSP by offering bandit- and delay-tolerant algorithms with provable surrogate-regret guarantees and showing how to leverage SELF structure to mitigate dependence on the output space size in challenging structured prediction tasks.

Abstract

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, \textit{i.e.,}~the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, \textit{bandit} and \textit{delayed} feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.

Bandit and Delayed Feedback in Online Structured Prediction

TL;DR

This work addresses online structured prediction (OSP) under weaker feedback: bandit and delayed feedback, as opposed to full-information. It builds on SELF and Fenchel–Young loss frameworks to relate target and surrogate losses, enabling efficient online optimization with guarantees. The authors present two bandit algorithms: one using inverse-weighted gradient estimators achieving surrogate regret, and a second using a pseudo-inverse matrix estimator achieving that is independent of for SELF* targets; they also extend the analysis to delayed feedback, obtaining tight bounds in both full-information and bandit contexts, with fixed and variable delays. In the delayed-bandit setting, they show and regret bounds, and provide complexity analyses and experimental validation demonstrating practical gains when is large. Overall, the paper broadens the applicability of OSP by offering bandit- and delay-tolerant algorithms with provable surrogate-regret guarantees and showing how to leverage SELF structure to mitigate dependence on the output space size in challenging structured prediction tasks.

Abstract

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, \textit{i.e.,}~the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, \textit{bandit} and \textit{delayed} feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of for the time horizon and the size of the output set . However, can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of , which is independent of . This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.

Paper Structure

This paper contains 86 sections, 30 theorems, 93 equations, 5 figures, 2 tables, 4 algorithms.

Key Result

Proposition 2.2

Let $\Psi\colon \mathbb{R}^d\!\to\!\mathbb{R}\cup\{+\infty\}$ be a differentiable, Legendre-type functionA function $\Psi$ is called Legendre-type if, for any sequence $x_1,x_2,\hdots$ in $\operatorname{int}(\mathop{\mathrm{dom}}\nolimits(\Psi))$ that converges to a boundary point of $\operatorname{

Figures (5)

  • Figure 1: Results of the synthetic experiments in multiclass classification with bandit feedback. In all figures, the horizontal axis represents the number of classes $K$, and the vertical axis represents the cumulative target loss.
  • Figure 6: A box plot of error rates of the MNIST experiment for multiclass classification with bandit feedback.
  • Figure 7: Results of the synsetic experiments in multilabel classification with bandit feedback. The horizontal axis shows the number of labels, and the vertical axis indicates the cumulative target loss.
  • Figure : Randomized decoding $\phi_\Omega$
  • Figure : Randomized decoding with uniform exploration (RDUE) $\psi_\Omega$

Theorems & Definitions (51)

  • Definition 1: JMLR_2020_blondel
  • Proposition 2.2: JMLR_2020_blondel and pmlr-v247-sakaue24a
  • Lemma 2.3: pmlr-v247-sakaue24a
  • Remark 1
  • Lemma 3.1
  • proof
  • Lemma 3.3: e.g., orabona2023modernintroductiononlinelearning
  • Remark 2
  • Theorem 3.4
  • proof : Proof of \ref{['thm:bandit_regret_expectation_abstract']}
  • ...and 41 more