Table of Contents
Fetching ...

A Generalized Version of Chung's Lemma and its Applications

Li Jiang, Xiao Li, Andre Milzarek, Junwen Qiu

TL;DR

This work introduces a generalized Chung's Lemma that accommodates a broad class of step-size rules by analyzing a recursion of the form $a_{k+1}\le(1-1/s(b_k))a_k+1/t(b_k)$ with convex rate mappings $r=b\mapsto s(b)/t(b)$. Leveraging this tool, the authors derive non-asymptotic convergence rates for stochastic gradient methods under the $(\theta,\mu)$-Polyak-Lojasiewicz condition for exponential, cosine, constant, and polynomial step sizes, with both SGD and random reshuffling. A key contribution is the splitting technique and an extension lemma that enable handling non-polynomial schedules and partial applicability across iterates, leading to rates that are explicit in the PL exponent $\theta$ and adapt to landscape and gradient-noise characteristics, notably showing exponential steps achieve landscape and noise adaptivity. The results unify and extend existing non-asymptotic analyses, providing practical rate bounds and insights into step-size design for stochastic optimization in nonconvex settings. This framework offers a systematic approach to certify convergence rates under general step-size dynamics, with clear implications for algorithmic tuning in large-scale learning problems.

Abstract

Chung's Lemma is a classical tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this work, we develop a generalized version of Chung's Lemma, which provides a simple non-asymptotic convergence framework for a more general family of step size rules. We demonstrate broad applicability of the proposed generalized lemma by deriving tight non-asymptotic convergence rates for a large variety of stochastic methods. In particular, we obtain partially new non-asymptotic complexity results for stochastic optimization methods, such as Stochastic Gradient Descent (SGD) and Random Reshuffling (RR), under a general $(θ,μ)$-Polyak-Lojasiewicz (PL) condition and for various step sizes strategies, including polynomial, constant, exponential, and cosine step sizes rules. Notably, as a by-product of our analysis, we observe that exponential step sizes exhibit superior adaptivity to both landscape geometry and gradient noise; specifically, they achieve optimal convergence rates without requiring exact knowledge of the underlying landscape or separate parameter selection strategies for noisy and noise-free regimes. Our results demonstrate that the developed variant of Chung's Lemma offers a versatile, systematic, and streamlined approach to establish non-asymptotic convergence rates under general step size rules.

A Generalized Version of Chung's Lemma and its Applications

TL;DR

This work introduces a generalized Chung's Lemma that accommodates a broad class of step-size rules by analyzing a recursion of the form with convex rate mappings . Leveraging this tool, the authors derive non-asymptotic convergence rates for stochastic gradient methods under the -Polyak-Lojasiewicz condition for exponential, cosine, constant, and polynomial step sizes, with both SGD and random reshuffling. A key contribution is the splitting technique and an extension lemma that enable handling non-polynomial schedules and partial applicability across iterates, leading to rates that are explicit in the PL exponent and adapt to landscape and gradient-noise characteristics, notably showing exponential steps achieve landscape and noise adaptivity. The results unify and extend existing non-asymptotic analyses, providing practical rate bounds and insights into step-size design for stochastic optimization in nonconvex settings. This framework offers a systematic approach to certify convergence rates under general step-size dynamics, with clear implications for algorithmic tuning in large-scale learning problems.

Abstract

Chung's Lemma is a classical tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this work, we develop a generalized version of Chung's Lemma, which provides a simple non-asymptotic convergence framework for a more general family of step size rules. We demonstrate broad applicability of the proposed generalized lemma by deriving tight non-asymptotic convergence rates for a large variety of stochastic methods. In particular, we obtain partially new non-asymptotic complexity results for stochastic optimization methods, such as Stochastic Gradient Descent (SGD) and Random Reshuffling (RR), under a general -Polyak-Lojasiewicz (PL) condition and for various step sizes strategies, including polynomial, constant, exponential, and cosine step sizes rules. Notably, as a by-product of our analysis, we observe that exponential step sizes exhibit superior adaptivity to both landscape geometry and gradient noise; specifically, they achieve optimal convergence rates without requiring exact knowledge of the underlying landscape or separate parameter selection strategies for noisy and noise-free regimes. Our results demonstrate that the developed variant of Chung's Lemma offers a versatile, systematic, and streamlined approach to establish non-asymptotic convergence rates under general step size rules.
Paper Structure (26 sections, 21 theorems, 138 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 26 sections, 21 theorems, 138 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Theorem 2.1

Let $\{a_k\}_k \subseteq \mathbb{R},\, \{b_k\}_k \subseteq \mathbb{R}$ and functions $s,t: \mathbb{R} \to \mathbb{R}$ be given. Assume that $\{a_k\}_k$ follows eq:general-recursion and there is an interval $I$ such that $b_k \in I$ for all $k\in[K]$, $s(x)\geqslant 1$ and $t(x)>0$ for all $x\in I$, then we have

Figures (2)

  • Figure 1: Heatmap of the rates $\mathcal{w}_s$, $\mathcal{w}_r$ w.r.t. $p,\theta$. The blue lines depict the optimal choice of $p$ w.r.t. $\theta$; left: $p = \varrho_s = \frac{2\theta}{4\theta-1}$ (${\sf SGD }$), right: $p = \varrho_r = \frac{\theta}{3\theta-1}$ (${\sf RR }$).
  • Figure 2: Relation between $\log_2(y_K)$ and $\log_2(K)$ for different $\theta$ with polynomial and exponential step sizes. A steeper slope corresponds to a faster sublinear rate.

Theorems & Definitions (25)

  • Theorem 2.1: Generalized Chung's Lemma
  • Example 2.2
  • Remark 2.3
  • Lemma 2.4: Non-asymptotic Chung's Lemma
  • Lemma 2.5: Extension Lemma
  • Corollary 2.6
  • Definition 3.3: ($\theta$,$\mu$)-PL Condition
  • Lemma 3.5: Descent-type Properties
  • Remark 3.6
  • Lemma 3.7
  • ...and 15 more