Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

Daniil Medyakov; Gleb Molodtsov; Savelii Chezhegov; Alexey Rebrikov; Aleksandr Beznosikov

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

Daniil Medyakov, Gleb Molodtsov, Savelii Chezhegov, Alexey Rebrikov, Aleksandr Beznosikov

TL;DR

Addresses scalable finite-sum optimization of the form $f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)$ and the limitation of standard VR methods that require full gradient computations. Proposes No Full Grad SVRG and No Full Grad SARAH by approximating the full gradient with a moving-average across shuffling epochs, leveraging SAG/SAGA ideas to achieve memory efficiency. Provides convergence guarantees in both non-convex and strongly convex regimes and establishes lower bounds for shuffling-based first-order methods, illustrating theoretical limits. Empirical results on CIFAR-10/100 with ResNet-18 demonstrate faster convergence and improved stability compared to baselines, underscoring practical impact for large-scale training.

Abstract

Stochastic optimization algorithms are widely used for machine learning with large-scale data. However, their convergence often suffers from non-vanishing variance. Variance Reduction (VR) methods, such as SVRG and SARAH, address this issue but introduce a bottleneck by requiring periodic full gradient computations. In this paper, we explore popular VR techniques and propose an approach that eliminates the necessity for expensive full gradient calculations. To avoid these computations and make our approach memory-efficient, we employ two key techniques: the shuffling heuristic and the concept of SAG/SAGA methods. For non-convex objectives, our convergence rates match those of standard shuffling methods, while under strong convexity, they demonstrate an improvement. We empirically validate the efficiency of our approach and demonstrate its scalability on large-scale machine learning tasks including image classification problem on CIFAR-10 and CIFAR-100 datasets.

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

TL;DR

Abstract

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (37)