Table of Contents
Fetching ...

Non-stationary Delayed Online Convex Optimization: From Full-information to Bandit Setting

Yuanyu Wan, Chang Yao, Yitao Ma, Mingli Song, Lijun Zhang

TL;DR

The work addresses online convex optimization with arbitrarily delayed feedback in non-stationary environments by targeting dynamic regret with comparator path-length $P_T$. It introduces Mild-OGD for full-information delays and Mild-BGD for bandit feedback, both built on a two-level framework (experts and a meta-learner) and enhanced by a delay-aware Hedge mechanism and blocking updates. The results establish tight dynamic-regret bounds: Mild-OGD achieves $O\big(\sqrt{\bar{d}T(P_T+1)}\big)$ under in-order delays and $O\big(\sqrt{dT(P_T+1)}\big)$ in the worst case, with a matching lower bound; Mild-BGD attains $O\big((\sqrt{n}T^{3/4}+(n\bar{d})^{1/3}T^{2/3})\sqrt{P_T+1}\big)$ under in-order delays and $O\big((\sqrt{n}T^{3/4}+\sqrt{dT})\sqrt{P_T+1}\big)$ in the worst case, approaching non-delayed bandit performance when delays are large. The analysis leverages $elta$-smoothed gradient estimators, blocking updates to decouple delays from gradient variance, and a doubling trick to adaptively handle unknown delay, yielding practically meaningful, tight performance guarantees for distributed, delayed online learning.

Abstract

Although online convex optimization (OCO) under arbitrary delays has received increasing attention recently, previous studies focus on stationary environments with the goal of minimizing static regret. In this paper, we investigate the delayed OCO in non-stationary environments, and choose dynamic regret with respect to any sequence of comparators as the performance metric. To this end, we first propose an algorithm called Mild-OGD for the full-information case, where delayed gradients are available. The basic idea is to maintain multiple experts in parallel, each performing a gradient descent step with different learning rates for every delayed gradient according to their arrival order, and utilize a meta-algorithm to track the best one based on their delayed performance. Despite the simplicity of this idea, our novel analysis shows that the dynamic regret of Mild-OGD can be automatically bounded by $O(\sqrt{\bar{d}T(P_T+1)})$ under the in-order assumption and $O(\sqrt{dT(P_T+1)})$ in the worst case, where $\bar{d}$ and $d$ denote the average and maximum delay respectively, $T$ is the time horizon, and $P_T$ is the path-length of comparators. Moreover, we demonstrate that the result in the worst case is optimal by deriving a matching lower bound. Finally, we develop a bandit variant of Mild-OGD for a more challenging case with only delayed loss values. Interestingly, we prove that under a relatively large amount of delay, our bandit algorithm even enjoys the best dynamic regret bound of existing non-delayed bandit algorithms.

Non-stationary Delayed Online Convex Optimization: From Full-information to Bandit Setting

TL;DR

The work addresses online convex optimization with arbitrarily delayed feedback in non-stationary environments by targeting dynamic regret with comparator path-length . It introduces Mild-OGD for full-information delays and Mild-BGD for bandit feedback, both built on a two-level framework (experts and a meta-learner) and enhanced by a delay-aware Hedge mechanism and blocking updates. The results establish tight dynamic-regret bounds: Mild-OGD achieves under in-order delays and in the worst case, with a matching lower bound; Mild-BGD attains under in-order delays and in the worst case, approaching non-delayed bandit performance when delays are large. The analysis leverages -smoothed gradient estimators, blocking updates to decouple delays from gradient variance, and a doubling trick to adaptively handle unknown delay, yielding practically meaningful, tight performance guarantees for distributed, delayed online learning.

Abstract

Although online convex optimization (OCO) under arbitrary delays has received increasing attention recently, previous studies focus on stationary environments with the goal of minimizing static regret. In this paper, we investigate the delayed OCO in non-stationary environments, and choose dynamic regret with respect to any sequence of comparators as the performance metric. To this end, we first propose an algorithm called Mild-OGD for the full-information case, where delayed gradients are available. The basic idea is to maintain multiple experts in parallel, each performing a gradient descent step with different learning rates for every delayed gradient according to their arrival order, and utilize a meta-algorithm to track the best one based on their delayed performance. Despite the simplicity of this idea, our novel analysis shows that the dynamic regret of Mild-OGD can be automatically bounded by under the in-order assumption and in the worst case, where and denote the average and maximum delay respectively, is the time horizon, and is the path-length of comparators. Moreover, we demonstrate that the result in the worst case is optimal by deriving a matching lower bound. Finally, we develop a bandit variant of Mild-OGD for a more challenging case with only delayed loss values. Interestingly, we prove that under a relatively large amount of delay, our bandit algorithm even enjoys the best dynamic regret bound of existing non-delayed bandit algorithms.
Paper Structure (30 sections, 22 theorems, 140 equations, 5 algorithms)

This paper contains 30 sections, 22 theorems, 140 equations, 5 algorithms.

Key Result

Theorem 1

Under Assumptions assum1 and assum2, for any comparator sequence $\mathbf{u}_1,\dots,\mathbf{u}_T\in\mathcal{K}$, Algorithm Ader-Expert with $\mathbf{g}_t^\eta=\nabla f_t(\mathbf{x}_t^\eta)$ ensures where $m_t=t-1-\sum_{i=1}^{t-1}|\mathcal{F}_i|$.

Theorems & Definitions (31)

  • Remark 1
  • Remark 2
  • Theorem 1
  • Lemma 1
  • Remark 3
  • Theorem 2
  • Remark 4
  • Theorem 3
  • Lemma 2
  • Theorem 4
  • ...and 21 more