Table of Contents
Fetching ...

Accelerating Single-Pass SGD for Generalized Linear Prediction

Qian Chen, Shihong Ding, Cong Fang

TL;DR

This work proposes the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration and demonstrates that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.

Abstract

We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.

Accelerating Single-Pass SGD for Generalized Linear Prediction

TL;DR

This work proposes the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration and demonstrates that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.

Abstract

We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.
Paper Structure (69 sections, 47 theorems, 222 equations, 3 algorithms)

This paper contains 69 sections, 47 theorems, 222 equations, 3 algorithms.

Key Result

Theorem 1

Suppose Assumptions assumption:l-condition, assumption:regularity and assumption:fourth-moment hold. Let Algorithm alg:sada start from $\tilde{\boldsymbol{\mathbf{x}}}_0$, and choose the hyperparameters as specified above in eq:param-choice-inner and eq:param-choice-outer. Let $\tilde{\boldsymbol{\m where $n=KT$ is the sample size and $c_0$ is a universal constant.

Theorems & Definitions (96)

  • Remark 1: Effect of Acceleration
  • Remark 2
  • Remark 3
  • Theorem 1
  • Corollary 1: Sample Complexity
  • Remark 4
  • Lemma 1
  • Corollary 2: Sample Complexity
  • Remark 5
  • Lemma 2: Verification of Assumption \ref{['assumption:gradient-noise-I']}
  • ...and 86 more