On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Anastasia Koloskova; Nikita Doikov; Sebastian U. Stich; Martin Jaggi

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

TL;DR

The paper analyzes stochastic gradient methods under arbitrary data ordering for non-convex smooth finite-sum objectives, introducing an adaptive correlation-time framework. By partitioning iterations into correlated blocks of size $\tau = \Theta(1/(L\gamma))$ and bounding a period-variance $\sigma^2_{\tau}$, it derives an $\mathcal{O}\left( \frac{F_0}{\gamma T} + L^2 \gamma^2 \sigma^2_{\tau} \right)$ convergence rate. This framework yields improved rates for Incremental Gradient and Single Shuffle over prior $n$-dependent bounds, and it recovers classical SGD rates in the appropriate limit. The results are complemented by experiments showing shuffle-based orders (SS and RR) often outperform plain SGD across quadratic, logistic, and neural network tasks, with stronger gains at smaller learning rates. Overall, the work provides a unified, order-aware theory that explains and quantifies when data ordering accelerates non-convex optimization in practice.

Abstract

In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms with arbitrary data ordering, within a broad framework for non-convex smooth functions. Our findings show enhanced convergence guarantees for incremental gradient and single shuffle SGD. Particularly if $n$ is the training set size, we improve $n$ times the optimization term of convergence guarantee to reach accuracy $\varepsilon$ from $O(n / \varepsilon)$ to $O(1 / \varepsilon)$.

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

TL;DR

and bounding a period-variance

, it derives an

convergence rate. This framework yields improved rates for Incremental Gradient and Single Shuffle over prior

-dependent bounds, and it recovers classical SGD rates in the appropriate limit. The results are complemented by experiments showing shuffle-based orders (SS and RR) often outperform plain SGD across quadratic, logistic, and neural network tasks, with stronger gains at smaller learning rates. Overall, the work provides a unified, order-aware theory that explains and quantifies when data ordering accelerates non-convex optimization in practice.

Abstract

is the training set size, we improve

times the optimization term of convergence guarantee to reach accuracy

from

Paper Structure (39 sections, 4 theorems, 68 equations, 7 figures, 2 tables)

This paper contains 39 sections, 4 theorems, 68 equations, 7 figures, 2 tables.

Introduction
Related work
The Algorithm
Assumptions
Quantifying the data orders
Comparison to the prior quantities
Classic bounded variance assumption.
Variance assumptions that take the data-ordering into account.
Our observation on effective correlation time
Intuitive informal explanation.
Main theorem
Implications for the Special Cases
SGD, Ex. \ref{['ex:sgd']}
Incremental Gradient and Single Shuffle, Ex. \ref{['ex:ig']}, Ex. \ref{['ex:SS']}
Single Function, Ex. \ref{['ex:singlefunc']}
...and 24 more sections

Key Result

Theorem 5.1

Let each of the functions $f_i$ be $L$-smooth (as:smooth). Let the stepsize $\gamma$ in Algorithm eq:algo satisfy: $\gamma \leq \frac{1}{8 \sqrt{3} L}$. Let $\tau = \bigl\lfloor\frac{1}{8 \sqrt{3} L \gamma} \bigr\rfloor$, and assume that $\mathop{\mathrm{\sigma^2_{\tau}}}\nolimits$ from Def. def:mai where $F_0 = f(\mathbf{x}_0) - f^\star$.

Figures (7)

Figure 1: Minimizing stochastic quadratic function for different strategies of sampling the gradients. Random Reshuffling (RR) and Single Shuffling (SS) work better than SGD.
Figure 2: Convergence curves for logistic regression on the Australian dataset chang2011libsvm. Random Reshuffling (RR) and Single Shuffling (SS) are faster than SGD across varying learning rates.
Figure 3: Training the neural network model on a subset of MNIST dataset of size $1000$. Random Reshuffling (RR) and Single Shuffling (SS) are better than SGD.
Figure 4: Training the neural network model on CIFAR dataset. Single Shuffle (SS) shows the best performance.
Figure 5: Variance estimation for the Logistic Regression model on w1a dataset. Our variance parameter $\sigma_{\tau}^2$ is significantly better than its corresponding upper bound $n \sigma_{\text{SGD}}^2$ used in classical SGD as well as in some prior works of analysing Shuffle SGD strategies Mishchenko2020RandomRS. When $\tau$ is smaller than $\frac{n}{2}$ our variance parameter $\sigma_{\tau}^2$ is also better than $\mathop{\mathrm{\sigma^2_{\operatorname{EPOCH}}}}\nolimits$ used to analyse Shuffle SGD in Mohtashami2022data_orderlu2022a:shuffle.
...and 2 more figures

Theorems & Definitions (14)

Example 3.1: SGD
Example 3.2: Incremental Gradient (IG)
Example 3.3: Single Shuffle (SS)
Example 3.4: Random Reshuffling (RR)
Example 3.5: Single Function
Definition 4.2: Sequence correlation
Example 4.3: Example when \ref{['eq:sigmasgd']} fails, Mohtashami2022data_order
Example 4.4
Example 4.5: Example when Assumption \ref{['eq:sigmaepoch']} fails
Theorem 5.1
...and 4 more

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

TL;DR

Abstract

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (14)