Table of Contents
Fetching ...

Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods

Ruinan Jin, Difei Cheng, Hong Qiao, Xin Shi, Shaodong Liu, Bo Zhang

TL;DR

This work advances SGD theory for non-convex problems by introducing a stopping-time martingale framework that yields asymptotic convergence under relaxed step-size rules. Specifically, it requires step sizes with $\sum_t \epsilon_t = \infty$ and $\sum_t \epsilon_t^p < \infty$ for some $p>2$, instead of the classical Robbins–Monro conditions, and forgoes global Lipschitz assumptions on the loss. The main results show almost-sure convergence of the loss to a critical value and, under a local high-order moment bound, almost-sure and $L^2$ convergence of the gradient norm. The approach broadens practical applicability by weakening standard assumptions (e.g., removing global Lipschitz of f and global higher-moment bounds) and using a stopping-time analysis to manage stochastic noise. Together, these findings offer a principled theoretical basis for using a wider range of step-size schedules in non-convex SGD, with direct implications for algorithm design and convergence guarantees in practice.

Abstract

Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes $ \{ ε_t \}_{t \geq 1} $ satisfying $\sum_{t=1}^{+\infty} ε_t = +\infty$ and $\sum_{t=1}^{+\infty} ε_t^p < +\infty$ for some $p > 2$. Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish $L_2$ convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.

Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods

TL;DR

This work advances SGD theory for non-convex problems by introducing a stopping-time martingale framework that yields asymptotic convergence under relaxed step-size rules. Specifically, it requires step sizes with and for some , instead of the classical Robbins–Monro conditions, and forgoes global Lipschitz assumptions on the loss. The main results show almost-sure convergence of the loss to a critical value and, under a local high-order moment bound, almost-sure and convergence of the gradient norm. The approach broadens practical applicability by weakening standard assumptions (e.g., removing global Lipschitz of f and global higher-moment bounds) and using a stopping-time analysis to manage stochastic noise. Together, these findings offer a principled theoretical basis for using a wider range of step-size schedules in non-convex SGD, with direct implications for algorithm design and convergence guarantees in practice.

Abstract

Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes satisfying and for some . Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.

Paper Structure

This paper contains 19 sections, 11 theorems, 161 equations, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\{ \theta_t \}_{t \ge 1} \subset \mathbb{R}^d$ be the sequence generated by Algorithm alg:sgd with the initial point $\theta_1$. Suppose the step sizes $\{\epsilon_t\}_{t \ge 1}$ satisfy the conditions in Setting assump:learning_rate, and Assumptions assump:loss_function and assump:stochastic_g

Theorems & Definitions (12)

  • Theorem 3.1
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Lemma 3.4
  • Theorem 3.2
  • Proposition 2
  • Remark 1
  • Theorem 3.3
  • Lemma 1.1
  • ...and 2 more