A Coefficient Makes SVRG Effective

Yida Yin; Zhiqiu Xu; Zhiyuan Li; Trevor Darrell; Zhuang Liu

A Coefficient Makes SVRG Effective

Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu

TL;DR

This work investigates whether Stochastic Variance Reduced Gradient (SVRG) can be effective for training deep neural networks. It introduces α-SVRG, which multiplies the SVRG variance-reduction term by a linearly decaying coefficient $\alpha$, grounded in a theoretical derivation that the optimal per-component coefficient $\alpha^{*}$ decreases with model depth and over training. Empirically, α-SVRG consistently reduces training loss and accelerates convergence across a wide range of architectures and datasets, outperforming both the baseline optimizer and vanilla SVRG. The results highlight the practical value of variance-reduction tuning in deep learning and invite further exploration of coefficient-based approaches to SVRG and related variance-reduction methods in neural network optimization.

Abstract

Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlight, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our empirical analysis finds that, for deeper neural networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $α$ to control the strength and adjust it through a linear decay schedule. We name our method $α$-SVRG. Our results show $α$-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at github.com/davidyyd/alpha-SVRG.

A Coefficient Makes SVRG Effective

TL;DR

, grounded in a theoretical derivation that the optimal per-component coefficient

decreases with model depth and over training. Empirically, α-SVRG consistently reduces training loss and accelerates convergence across a wide range of architectures and datasets, outperforming both the baseline optimizer and vanilla SVRG. The results highlight the practical value of variance-reduction tuning in deep learning and invite further exploration of coefficient-based approaches to SVRG and related variance-reduction methods in neural network optimization.

Abstract

to control the strength and adjust it through a linear decay schedule. We name our method

-SVRG. Our results show

-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at github.com/davidyyd/alpha-SVRG.

Paper Structure (24 sections, 13 equations, 28 figures, 11 tables, 2 algorithms)

This paper contains 24 sections, 13 equations, 28 figures, 11 tables, 2 algorithms.

Introduction
Motivation: SVRG may not always reduce variance
A Closer Look at Control Variates in SVRG
$\alpha$-SVRG
Experiments
Settings
Results
Analysis
Related Work
Conclusion
Derivation of the optimal coefficient
Experimental Settings
Additional Results of $\alpha$-SVRG
Different Initial Coefficients
Standard Deviation Results
...and 9 more sections

Figures (28)

Figure 1: SVRG vs. $\boldsymbol{\alpha}$-SVRG. Both SVRG (left) and $\alpha$-SVRG (right) use the difference between snapshot stochastic gradient (gray) and snapshot full gradient (blue) to form a variance reduction term (orange), which modifies model stochastic gradient (black) into variance reduced gradient (red). But $\alpha$-SVRG employs a coefficient $\alpha$ to modulate the strength of the variance reduction term. With this coefficient, $\alpha$-SVRG reduces the gradient variance and results in faster convergence.
Figure 2: SVRG on Logistic Regression. SVRG effectively reduces the gradient variance for Logistic Regression, leading to a lower training loss than the baseline.
Figure 3: SVRG on MLP-4. In the first few epochs, SVRG reduces the gradient variance for MLP-4, but afterward, SVRG increases it, well above the baseline. As a result, SVRG exhibits a higher training loss than the baseline at the end of training.
Figure 5: SVRG with optimal coefficient on MLP-4. SVRG with the optimal coefficient reduces gradient variance stably and achieves a lower training loss than the baseline SGD.
Figure 6: $\boldsymbol{\alpha}$-SVRG on MLP-4.$\alpha$-SVRG behaves similarly to SVRG with optimal coefficient.
...and 23 more figures

A Coefficient Makes SVRG Effective

TL;DR

Abstract

A Coefficient Makes SVRG Effective

Authors

TL;DR

Abstract

Table of Contents

Figures (28)