Non-Stationary Online Structured Prediction with Surrogate Losses
Shinsaku Sakaue, Han Bao, Yuzhou Cao
TL;DR
The paper tackles non-stationary online structured prediction under surrogate losses, where classical finite surrogate regret bounds fail to control the target loss. It derives a tight, data‑dependent bound on cumulative target loss: $\sum_{t=1}^T \ell(\hat{y}_t,y_t) = F_T + C(1 + P_T)$, with $F_T$ the surrogate loss of a comparator sequence and $P_T$ its path length, by fusing dynamic regret analysis of online gradient descent with a surrogate-gap decoding mechanism. A Polyak‑style learning rate is proposed to guarantee target-loss bounds in practice, and the framework is extended to the broader class of convolutional Fenchel–Young losses, enabling nontrivial targets like ranking and NDCG. A matching lower bound shows the $F_T$ and $P_T$ dependencies are tight in the worst case, and the results collectively provide non-stationary, target-loss guarantees in full-information online structured prediction. The work also highlights practical implications for adaptive optimization and decoding in non-stationary environments, with empirical support for the proposed learning-rate strategy.
Abstract
Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. Therein the surrogate regret -- the cumulative excess of the target loss (e.g., 0-1 loss) over the surrogate loss (e.g., logistic loss) of the fixed best estimator -- has gained attention, particularly because it often admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur the surrogate loss growing linearly with $T$. We address this by proving a bound of the form $F_T + C(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence, $P_T$ is its path length, and $C > 0$ is some constant. This bound depends on $T$ only through $F_T$ and $P_T$, often yielding much stronger guarantees in non-stationary environments. Our core idea is to synthesize the dynamic regret bound of the online gradient descent (OGD) with the technique of exploiting the surrogate gap. Our analysis also sheds light on a new Polyak-style learning rate for OGD, which systematically offers target-loss guarantees and exhibits promising empirical performance. We further extend our approach to a broader class of problems via the convolutional Fenchel--Young loss. Finally, we prove a lower bound showing that the dependence on $F_T$ and $P_T$ is tight.
