Accelerating Single-Pass SGD for Generalized Linear Prediction

Qian Chen; Shihong Ding; Cong Fang

Accelerating Single-Pass SGD for Generalized Linear Prediction

Qian Chen, Shihong Ding, Cong Fang

TL;DR

This work proposes the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration and demonstrates that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.

Abstract

We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.

Accelerating Single-Pass SGD for Generalized Linear Prediction

TL;DR

Abstract

Paper Structure (69 sections, 47 theorems, 222 equations, 3 algorithms)

This paper contains 69 sections, 47 theorems, 222 equations, 3 algorithms.

Introduction
Review: Previous Results
Well-specified Linear Regression.
Variance Reduction for Generalized Linear Prediction.
Our Results and Implications
Notations.
Related Work
Stochastic Approximation.
Momentum Acceleration and Variance Reduction.
Problem Setup
Assumptions
Summary of Problem-Dependent Quantities
Stochastic Accelerated Data-Dependent Algorithm
Inner Loop: Accelerated Solver with Tail-Averaging
Outer Loop: Data-Dependent Proximal Method with Acceleration
...and 54 more sections

Key Result

Theorem 1

Suppose Assumptions assumption:l-condition, assumption:regularity and assumption:fourth-moment hold. Let Algorithm alg:sada start from $\tilde{\boldsymbol{\mathbf{x}}}_0$, and choose the hyperparameters as specified above in eq:param-choice-inner and eq:param-choice-outer. Let $\tilde{\boldsymbol{\m where $n=KT$ is the sample size and $c_0$ is a universal constant.

Theorems & Definitions (96)

Remark 1: Effect of Acceleration
Remark 2
Remark 3
Theorem 1
Corollary 1: Sample Complexity
Remark 4
Lemma 1
Corollary 2: Sample Complexity
Remark 5
Lemma 2: Verification of Assumption \ref{['assumption:gradient-noise-I']}
...and 86 more

Accelerating Single-Pass SGD for Generalized Linear Prediction

TL;DR

Abstract

Accelerating Single-Pass SGD for Generalized Linear Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (96)