Table of Contents
Fetching ...

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

Daniil Medyakov, Gleb Molodtsov, Savelii Chezhegov, Alexey Rebrikov, Aleksandr Beznosikov

TL;DR

Addresses scalable finite-sum optimization of the form $f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)$ and the limitation of standard VR methods that require full gradient computations. Proposes No Full Grad SVRG and No Full Grad SARAH by approximating the full gradient with a moving-average across shuffling epochs, leveraging SAG/SAGA ideas to achieve memory efficiency. Provides convergence guarantees in both non-convex and strongly convex regimes and establishes lower bounds for shuffling-based first-order methods, illustrating theoretical limits. Empirical results on CIFAR-10/100 with ResNet-18 demonstrate faster convergence and improved stability compared to baselines, underscoring practical impact for large-scale training.

Abstract

Stochastic optimization algorithms are widely used for machine learning with large-scale data. However, their convergence often suffers from non-vanishing variance. Variance Reduction (VR) methods, such as SVRG and SARAH, address this issue but introduce a bottleneck by requiring periodic full gradient computations. In this paper, we explore popular VR techniques and propose an approach that eliminates the necessity for expensive full gradient calculations. To avoid these computations and make our approach memory-efficient, we employ two key techniques: the shuffling heuristic and the concept of SAG/SAGA methods. For non-convex objectives, our convergence rates match those of standard shuffling methods, while under strong convexity, they demonstrate an improvement. We empirically validate the efficiency of our approach and demonstrate its scalability on large-scale machine learning tasks including image classification problem on CIFAR-10 and CIFAR-100 datasets.

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

TL;DR

Addresses scalable finite-sum optimization of the form and the limitation of standard VR methods that require full gradient computations. Proposes No Full Grad SVRG and No Full Grad SARAH by approximating the full gradient with a moving-average across shuffling epochs, leveraging SAG/SAGA ideas to achieve memory efficiency. Provides convergence guarantees in both non-convex and strongly convex regimes and establishes lower bounds for shuffling-based first-order methods, illustrating theoretical limits. Empirical results on CIFAR-10/100 with ResNet-18 demonstrate faster convergence and improved stability compared to baselines, underscoring practical impact for large-scale training.

Abstract

Stochastic optimization algorithms are widely used for machine learning with large-scale data. However, their convergence often suffers from non-vanishing variance. Variance Reduction (VR) methods, such as SVRG and SARAH, address this issue but introduce a bottleneck by requiring periodic full gradient computations. In this paper, we explore popular VR techniques and propose an approach that eliminates the necessity for expensive full gradient calculations. To avoid these computations and make our approach memory-efficient, we employ two key techniques: the shuffling heuristic and the concept of SAG/SAGA methods. For non-convex objectives, our convergence rates match those of standard shuffling methods, while under strong convexity, they demonstrate an improvement. We empirically validate the efficiency of our approach and demonstrate its scalability on large-scale machine learning tasks including image classification problem on CIFAR-10 and CIFAR-100 datasets.

Paper Structure

This paper contains 34 sections, 24 theorems, 115 equations, 7 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Suppose Assumptions as1, as2 hold. Then for Algorithm alg2 a valid estimate is

Figures (7)

  • Figure 1: No Full Grad SVRG and SVRG.
  • Figure 2: No Full Grad SARAH and SARAH.
  • Figure 3: No Full Grad SVRG and SVRG convergence with theoretical and tuned step sizes on problem \ref{['eq:9']} on the ijcnn1 (left) and a9a (right) datasets.
  • Figure 4: No Full Grad SARAH and SARAH convergence with theoretical and tuned step sizes on problem \ref{['eq:9']} on the ijcnn1 (left) and a9a (right) datasets.
  • Figure 5: No Full Grad SVRG and SVRG on CIFAR-100 convergence.
  • ...and 2 more figures

Theorems & Definitions (37)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Theorem 3
  • Theorem 4
  • Theorem 5: Lower bound
  • Theorem 6: Non-optimality
  • ...and 27 more