Table of Contents
Fetching ...

Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality

Yu-Hu Yan, Peng Zhao, Zhi-Hua Zhou

TL;DR

This paper links accelerated offline convex optimization with online-to-batch conversions by introducing optimistic O2B conversions that embed look-ahead information into the analysis. The approach yields accelerated convergence for convex smooth objectives, extends to strongly convex objectives with optimal rates, and develops universal variants that adapt to smooth and non-smooth settings while using only one gradient query per iteration. It clarifies the connection to Nesterov's Accelerated Gradient and Polyak's Heavy-Ball, and demonstrates competitive empirical performance on standard convex problems. The work broadens the online-learning lens on acceleration and offers practical, horizon-efficient algorithms for a broad class of convex problems.

Abstract

In this work, we study offline convex optimization with smooth objectives, where the classical Nesterov's Accelerated Gradient (NAG) method achieves the optimal accelerated convergence. Extensive research has aimed to understand NAG from various perspectives, and a recent line of work approaches this from the viewpoint of online learning and online-to-batch conversion, emphasizing the role of optimistic online algorithms for acceleration. In this work, we contribute to this perspective by proposing novel optimistic online-to-batch conversions that incorporate optimism theoretically into the analysis, thereby significantly simplifying the online algorithm design while preserving the optimal convergence rates. Specifically, we demonstrate the effectiveness of our conversions through the following results: (i) when combined with simple online gradient descent, our optimistic conversion achieves the optimal accelerated convergence; (ii) our conversion also applies to strongly convex objectives, and by leveraging both optimistic online-to-batch conversion and optimistic online algorithms, we achieve the optimal accelerated convergence rate for strongly convex and smooth objectives, for the first time through the lens of online-to-batch conversion; (iii) our optimistic conversion can achieve universality to smoothness -- applicable to both smooth and non-smooth objectives without requiring knowledge of the smoothness coefficient -- and remains efficient as non-universal methods by using only one gradient query in each iteration. Finally, we highlight the effectiveness of our optimistic online-to-batch conversions by a precise correspondence with NAG.

Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality

TL;DR

This paper links accelerated offline convex optimization with online-to-batch conversions by introducing optimistic O2B conversions that embed look-ahead information into the analysis. The approach yields accelerated convergence for convex smooth objectives, extends to strongly convex objectives with optimal rates, and develops universal variants that adapt to smooth and non-smooth settings while using only one gradient query per iteration. It clarifies the connection to Nesterov's Accelerated Gradient and Polyak's Heavy-Ball, and demonstrates competitive empirical performance on standard convex problems. The work broadens the online-learning lens on acceleration and offers practical, horizon-efficient algorithms for a broad class of convex problems.

Abstract

In this work, we study offline convex optimization with smooth objectives, where the classical Nesterov's Accelerated Gradient (NAG) method achieves the optimal accelerated convergence. Extensive research has aimed to understand NAG from various perspectives, and a recent line of work approaches this from the viewpoint of online learning and online-to-batch conversion, emphasizing the role of optimistic online algorithms for acceleration. In this work, we contribute to this perspective by proposing novel optimistic online-to-batch conversions that incorporate optimism theoretically into the analysis, thereby significantly simplifying the online algorithm design while preserving the optimal convergence rates. Specifically, we demonstrate the effectiveness of our conversions through the following results: (i) when combined with simple online gradient descent, our optimistic conversion achieves the optimal accelerated convergence; (ii) our conversion also applies to strongly convex objectives, and by leveraging both optimistic online-to-batch conversion and optimistic online algorithms, we achieve the optimal accelerated convergence rate for strongly convex and smooth objectives, for the first time through the lens of online-to-batch conversion; (iii) our optimistic conversion can achieve universality to smoothness -- applicable to both smooth and non-smooth objectives without requiring knowledge of the smoothness coefficient -- and remains efficient as non-universal methods by using only one gradient query in each iteration. Finally, we highlight the effectiveness of our optimistic online-to-batch conversions by a precise correspondence with NAG.

Paper Structure

This paper contains 39 sections, 15 theorems, 90 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

If the objective function $f(\cdot)$ is convex, then we have where $\widetilde{\mathbf{x}}_t \triangleq \frac{1}{A_t} (\sum_{s=1}^{t-1} \alpha_s \mathbf{x}_s + \alpha_t \mathbf{x}_{t-1})$ and $\bar{\mathbf{x}}_t \triangleq \frac{1}{A_t} (\sum_{s=1}^{t-1} \alpha_s \mathbf{x}_s + \alpha_t \mathbf{x}_t)$.

Figures (3)

  • Figure 1: Comparison of the update between the optimistic and stabilized conversions, where $\mathbf{g}_t^{\textsf{O}} = \alpha_t \nabla f(\widetilde{\mathbf{x}}_t)$ and $\mathbf{g}_t^{\textsf{S}} = \alpha_t \nabla f(\bar{\mathbf{x}}_t)$ represent the losses faced by the optimistic and stabilized conversions, and $\mathbf{x} \xrightarrow{\mathbf{g}} \mathbf{y}$ denotes updating from $\mathbf{x}$ to $\mathbf{y}$ using the information $\mathbf{g}$. Compared with the stabilized conversion, ours can update with the information of the upcoming losses.
  • Figure 2: Comparison in the non-universal setting of the convergence curves and time complexity. Our method (Ours) is compared with classic non-universal methods NAG\ref{['eq:NAG']}, GD, and UniXGrad on one squared loss task (Squared) and five $\ell_2$-regularized logistic regression tasks (a1a, mushrooms, splice, splice-scale and svmguide3). Ours achieves similar convergence as NAG and UniXGrad while being faster than GD.
  • Figure 3: Comparison in the universal setting of the convergence curves and time complexity. Our methods --- Ours (1Grad) and Ours (2Grad) --- are compared with classic universal methods UniXGrad and JRGS'20 on one squared loss task (Squared) and five $\ell_2$-regularized logistic regression tasks (a1a, mushrooms, splice, splice-scale and svmguide3). Our methods achieve comparable convergence behavior compared with the other contenders. Ours (1Grad) is more efficient than Ours (2Grad) and UniXGrad.

Theorems & Definitions (29)

  • Theorem 1
  • Corollary 1
  • Definition 1: Strong Convexity
  • Theorem 2
  • Theorem 3
  • Corollary 2
  • Theorem 4
  • Remark 1: Technical Comparison
  • Theorem 5
  • Remark 2: Boundedness Assumption
  • ...and 19 more