On Convergence of Incremental Gradient for Non-Convex Smooth Functions
Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi
TL;DR
The paper analyzes stochastic gradient methods under arbitrary data ordering for non-convex smooth finite-sum objectives, introducing an adaptive correlation-time framework. By partitioning iterations into correlated blocks of size $\tau = \Theta(1/(L\gamma))$ and bounding a period-variance $\sigma^2_{\tau}$, it derives an $\mathcal{O}\left( \frac{F_0}{\gamma T} + L^2 \gamma^2 \sigma^2_{\tau} \right)$ convergence rate. This framework yields improved rates for Incremental Gradient and Single Shuffle over prior $n$-dependent bounds, and it recovers classical SGD rates in the appropriate limit. The results are complemented by experiments showing shuffle-based orders (SS and RR) often outperform plain SGD across quadratic, logistic, and neural network tasks, with stronger gains at smaller learning rates. Overall, the work provides a unified, order-aware theory that explains and quantifies when data ordering accelerates non-convex optimization in practice.
Abstract
In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms with arbitrary data ordering, within a broad framework for non-convex smooth functions. Our findings show enhanced convergence guarantees for incremental gradient and single shuffle SGD. Particularly if $n$ is the training set size, we improve $n$ times the optimization term of convergence guarantee to reach accuracy $\varepsilon$ from $O(n / \varepsilon)$ to $O(1 / \varepsilon)$.
