Table of Contents
Fetching ...

SVRG and Beyond via Posterior Correction

Nico Daheim, Thomas Möllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

TL;DR

The paper investigates why SVRG, despite limited success in deep learning, can be connected to Bayesian posterior correction. By reframing SVRG as a PoCo update over exponential-family posteriors, SVRG is recovered as a special case under isotropic-Gaussians, while richer Gaussian families yield new variants such as a Newton-like SVRG (SVRH) and an Adam-like IVON-PoCo for Transformer-scale pretraining and finetuning. Empirical results across logistic regression, ResNet-50/ImageNet, and GPT-2 pretraining show faster convergence and improved perplexity, with ablations clarifying the roles of correction strength and inner/outer-loop dynamics. The work frames variance reduction as a form of knowledge transfer, broadening SVRG-style ideas to variational training and non-traditional DL settings, and suggesting broad applicability to continual learning, federated learning, and model merging.

Abstract

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.

SVRG and Beyond via Posterior Correction

TL;DR

The paper investigates why SVRG, despite limited success in deep learning, can be connected to Bayesian posterior correction. By reframing SVRG as a PoCo update over exponential-family posteriors, SVRG is recovered as a special case under isotropic-Gaussians, while richer Gaussian families yield new variants such as a Newton-like SVRG (SVRH) and an Adam-like IVON-PoCo for Transformer-scale pretraining and finetuning. Empirical results across logistic regression, ResNet-50/ImageNet, and GPT-2 pretraining show faster convergence and improved perplexity, with ablations clarifying the roles of correction strength and inner/outer-loop dynamics. The work frames variance reduction as a form of knowledge transfer, broadening SVRG-style ideas to variational training and non-traditional DL settings, and suggesting broad applicability to continual learning, federated learning, and model merging.

Abstract

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.

Paper Structure

This paper contains 27 sections, 3 theorems, 26 equations, 7 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

For isotropic-Gaussian family, eq:blr_svrg reduces to the following update of the mean $\mathbf{m}_\text{in}$,

Figures (7)

  • Figure 1: (a) We present a generalization of SVRG by using Posterior Correction (PoCo) where gradients used in SGD are replaced by natural gradients of VB objectives via the Bayesian Learning Rule (BLR). (b) Our new IVON-PoCoMo (red) improves performance over IVON and AdamW when pretraining GPT2-125M from scratch on ca. 50B tokens from OpenWebText. Until 50K steps IVON-PoCo takes the same steps as IVON, and a huge boost is obtained when correction is started. Validation Perplexities at the end are 17.4, 18.0, 18.4. (c) We show three different IVON-PoCo runs where correction is started at a different iteration (pink to red). We see consistent improvements irrespective of the starting iteration.
  • Figure 2: SVRG
  • Figure 3: IVON-PoCo significantly boosts the convergence speed of IVON and performs much better than SVRG, here on two convex logistic regression problems of varying dimension and size. The horizontal dashed line indicates the performance at the minimum, the gray bars indicate outer gradient computations used in SVRG and IVON-PoCo.
  • Figure 4: Performance on ImageNet for ResNet-50. When comparing by the number of optimization steps (left) IVON-PoCo gives clear improvements but not in terms of data examples seen (right). On the left, we zoom in on the final stage of training when correction is added.
  • Figure 5: Comparison to $\alpha$-SVRG on CIFAR-10 for ResNet-20. IVON-PoCo and $\alpha$-SVRG with SGD give improvements, also when counting the number of data examples seen (right). At the end of training, IVON-PoCo performs best out of all methods.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3