Table of Contents
Fetching ...

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

TL;DR

The paper analyzes stochastic gradient methods under arbitrary data ordering for non-convex smooth finite-sum objectives, introducing an adaptive correlation-time framework. By partitioning iterations into correlated blocks of size $\tau = \Theta(1/(L\gamma))$ and bounding a period-variance $\sigma^2_{\tau}$, it derives an $\mathcal{O}\left( \frac{F_0}{\gamma T} + L^2 \gamma^2 \sigma^2_{\tau} \right)$ convergence rate. This framework yields improved rates for Incremental Gradient and Single Shuffle over prior $n$-dependent bounds, and it recovers classical SGD rates in the appropriate limit. The results are complemented by experiments showing shuffle-based orders (SS and RR) often outperform plain SGD across quadratic, logistic, and neural network tasks, with stronger gains at smaller learning rates. Overall, the work provides a unified, order-aware theory that explains and quantifies when data ordering accelerates non-convex optimization in practice.

Abstract

In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms with arbitrary data ordering, within a broad framework for non-convex smooth functions. Our findings show enhanced convergence guarantees for incremental gradient and single shuffle SGD. Particularly if $n$ is the training set size, we improve $n$ times the optimization term of convergence guarantee to reach accuracy $\varepsilon$ from $O(n / \varepsilon)$ to $O(1 / \varepsilon)$.

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

TL;DR

The paper analyzes stochastic gradient methods under arbitrary data ordering for non-convex smooth finite-sum objectives, introducing an adaptive correlation-time framework. By partitioning iterations into correlated blocks of size and bounding a period-variance , it derives an convergence rate. This framework yields improved rates for Incremental Gradient and Single Shuffle over prior -dependent bounds, and it recovers classical SGD rates in the appropriate limit. The results are complemented by experiments showing shuffle-based orders (SS and RR) often outperform plain SGD across quadratic, logistic, and neural network tasks, with stronger gains at smaller learning rates. Overall, the work provides a unified, order-aware theory that explains and quantifies when data ordering accelerates non-convex optimization in practice.

Abstract

In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms with arbitrary data ordering, within a broad framework for non-convex smooth functions. Our findings show enhanced convergence guarantees for incremental gradient and single shuffle SGD. Particularly if is the training set size, we improve times the optimization term of convergence guarantee to reach accuracy from to .
Paper Structure (39 sections, 4 theorems, 68 equations, 7 figures, 2 tables)

This paper contains 39 sections, 4 theorems, 68 equations, 7 figures, 2 tables.

Key Result

Theorem 5.1

Let each of the functions $f_i$ be $L$-smooth (as:smooth). Let the stepsize $\gamma$ in Algorithm eq:algo satisfy: $\gamma \leq \frac{1}{8 \sqrt{3} L}$. Let $\tau = \bigl\lfloor\frac{1}{8 \sqrt{3} L \gamma} \bigr\rfloor$, and assume that $\mathop{\mathrm{\sigma^2_{\tau}}}\nolimits$ from Def. def:mai where $F_0 = f(\mathbf{x}_0) - f^\star$.

Figures (7)

  • Figure 1: Minimizing stochastic quadratic function for different strategies of sampling the gradients. Random Reshuffling (RR) and Single Shuffling (SS) work better than SGD.
  • Figure 2: Convergence curves for logistic regression on the Australian dataset chang2011libsvm. Random Reshuffling (RR) and Single Shuffling (SS) are faster than SGD across varying learning rates.
  • Figure 3: Training the neural network model on a subset of MNIST dataset of size $1000$. Random Reshuffling (RR) and Single Shuffling (SS) are better than SGD.
  • Figure 4: Training the neural network model on CIFAR dataset. Single Shuffle (SS) shows the best performance.
  • Figure 5: Variance estimation for the Logistic Regression model on w1a dataset. Our variance parameter $\sigma_{\tau}^2$ is significantly better than its corresponding upper bound $n \sigma_{\text{SGD}}^2$ used in classical SGD as well as in some prior works of analysing Shuffle SGD strategies Mishchenko2020RandomRS. When $\tau$ is smaller than $\frac{n}{2}$ our variance parameter $\sigma_{\tau}^2$ is also better than $\mathop{\mathrm{\sigma^2_{\operatorname{EPOCH}}}}\nolimits$ used to analyse Shuffle SGD in Mohtashami2022data_orderlu2022a:shuffle.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Example 3.1: SGD
  • Example 3.2: Incremental Gradient (IG)
  • Example 3.3: Single Shuffle (SS)
  • Example 3.4: Random Reshuffling (RR)
  • Example 3.5: Single Function
  • Definition 4.2: Sequence correlation
  • Example 4.3: Example when \ref{['eq:sigmasgd']} fails, Mohtashami2022data_order
  • Example 4.4
  • Example 4.5: Example when Assumption \ref{['eq:sigmaepoch']} fails
  • Theorem 5.1
  • ...and 4 more