Table of Contents
Fetching ...

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Xiang Li, Qiaomin Xie

TL;DR

The paper addresses the challenge of efficiently tuning SGD with a constant stepsize by exploiting the Markov-chain view of the iterates and detecting the transition from the transient to stationary phase. It introduces a coupling-based stationarity diagnostic that runs two coupled SGD sequences and monitors the ratio $\|D_k\|^2/\|D_0\|^2$ of their differences, triggering a stepsize reduction when the statistic indicates stationarity; the approach is theoretically justified in both quadratic and general convex settings via $W_2$ convergence bounds. Key contributions include a simple, practical diagnostic, an adaptive threshold variant, and extensive empirical validation across convex and non-convex problems (e.g., logistic regression, least squares, and ResNet-18 on CIFAR-10) showing superior performance and robustness compared to existing diagnostics. The method enables more efficient use of fixed-stepsize SGD, with potential impact on broad stochastic optimization tasks and reinforcement learning where data are costly or limited.

Abstract

The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel coupling-based convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

TL;DR

The paper addresses the challenge of efficiently tuning SGD with a constant stepsize by exploiting the Markov-chain view of the iterates and detecting the transition from the transient to stationary phase. It introduces a coupling-based stationarity diagnostic that runs two coupled SGD sequences and monitors the ratio of their differences, triggering a stepsize reduction when the statistic indicates stationarity; the approach is theoretically justified in both quadratic and general convex settings via convergence bounds. Key contributions include a simple, practical diagnostic, an adaptive threshold variant, and extensive empirical validation across convex and non-convex problems (e.g., logistic regression, least squares, and ResNet-18 on CIFAR-10) showing superior performance and robustness compared to existing diagnostics. The method enables more efficient use of fixed-stepsize SGD, with potential impact on broad stochastic optimization tasks and reinforcement learning where data are costly or limited.

Abstract

The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel coupling-based convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

Paper Structure

This paper contains 24 sections, 5 theorems, 30 equations, 12 figures, 2 algorithms.

Key Result

Proposition 1

Suppose that Assumptions assumption: L-smooth--assumption: bounded_variance hold. With constant stepsize $\gamma \in (0,2/L),$ the Markov chain $(\theta_k)_{k\geq 0}$ given by the recursion eq: SGD satisfies: where $P^k_\gamma$ is the $k$-step Markov kernel for the chain $(\theta_k)_{k\geq 0}$, $W_2(\nu,\nu')$ is the Wasserstein distance of order two between measures $\nu,\nu'\in \mathcal{P}_2(\m

Figures (12)

  • Figure 1: Evolution of $\|\theta^{(1)}_k - \theta^\star\|^2$ and $\|\theta^{(2)}_k - \theta^\star\|^2$ and $\|\theta^{(1)}_k - \theta^{(2)}_k\|^2$ under least squares regression with two different constant stepsizes.
  • Figure 2: Effectiveness of our coupling-based statistic for stationarity diagnostic. Left: Algorithm \ref{['alg:static']} with static threshold; Right: Algorithm \ref{['alg:adaptive']} with adaptive threshold. The vertical lines correspond to restarts of our coupling-based algorithms.
  • Figure 3: Logistic regression (left) and Least squares regression (right). The initial stepsize of coupling/distance-based and $\text{ISGD}^{1/2}$ is $\gamma_0 = 4/R^2$ for logistic regression, and $\gamma_0 = 1/2R^2$ for least squares. The errors are averaged over $10$ replications.
  • Figure 4: Robustness results under logistic regression and least squares regression (LSR) with $d = 100$.
  • Figure 5: ResNet-18 Test Accuracy on CIFAR-10 dataset. The initial stepsizes are 0.01.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Proposition 1: Proposition 2 in dieuleveut2020bridging
  • Proposition 2
  • Proposition 3: Quadratic
  • proof
  • Claim 1
  • Theorem 1: General Convex
  • proof
  • Lemma 1
  • proof
  • proof
  • ...and 1 more