Variational Learning is Effective for Large Deep Networks

Yuesong Shen; Nico Daheim; Bai Cong; Peter Nickl; Gian Maria Marconi; Clement Bazan; Rio Yokota; Iryna Gurevych; Daniel Cremers; Mohammad Emtiyaz Khan; Thomas Möllenhoff

Variational Learning is Effective for Large Deep Networks

Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Clement Bazan, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, Thomas Möllenhoff

TL;DR

This work challenges the belief that variational learning cannot scale to large deep networks by introducing IVON, a scalable Hessian-aware optimizer that directly optimizes the variational objective with costs comparable to Adam. IVON achieves state-of-the-art accuracy and better predictive uncertainty on large models, including GPT-2 variants and ResNets, and enables practical use cases such as fine-tuning, model merging, and data-sensitivity diagnostics. It demonstrates compelling results across language modeling and image classification, and provides methods for reliable generalization estimation and robust uncertainty under distribution shift. The findings suggest variational principles can yield tangible benefits in calibration, uncertainty, and transferability for large-scale deep learning applications.

Abstract

We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.

Variational Learning is Effective for Large Deep Networks

TL;DR

Abstract

Paper Structure (43 sections, 9 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 43 sections, 9 equations, 8 figures, 14 tables, 1 algorithm.

Introduction
Challenges of Variational Learning for Large Deep Networks
Improved Variational Online Newton
IVON is Effective for Large Deep Networks
Better Scalability and Generalization
Pretraining language models
Image classification
Posterior Averaging for Predictive Uncertainty
In-domain comparisons
Out-of-domain (OOD) comparisons
MC samples for averaging
NeurIPS 2021 Competition
Finetuning and Model Merging
Finetuning pretrained language models
Merging masked-language models
...and 28 more sections

Figures (8)

Figure 1: First two panels show that IVON closely matches the trajectory of AdamW LoHu17 for training GPT-2 on OpenWebText and ResNet-50 on ImageNet. The computational costs of IVON and AdamW are nearly identical. Runtime in hours (h) is indicated by the arrows. The third panel shows that the predictions are also better calibrated as the red curve is closer to diagonal. Comparisons to SGD on ImageNet are in \ref{['tab:imagenet']}. Final numbers for IVON vs AdamW are as follows: 12.6 vs. 13.0 perplexity (lower is better) on GPT-2 (773M), 14.1 vs 14.5 perplexity on GPT-2 (355M), 17.9 vs 18.1 perplexity on GPT-2 (125M), 77.5 vs 75.2 accuracy and 0.022 vs 0.066 ECE (lower is better) on ResNet-50.
Figure 2: Panel (a) shows that, when training GPT-2, IVON not only improves upon AdamW in terms of validation perplexity but also converges to matching or even better training loss than AdamW. Panel b shows that IVON provides stable training when using low-precision (bf16) floating point numbers. Panel (c) shows that averaging predictions over IVON's posterior further improves the validation perplexity on GPT-2, when a sufficient number of samples is used ($>$ 8).
Figure 3: In panel (a) and (b), we see that IVON's histogram of predictive entropy has a high peak similar to SGD for in-domain data (red, CIFAR-10) but at the same time is spread out widely similar to the other Bayesian deep learning methods for out-of-domain data (gray). The colors are shaded proportional to the height of the peak, that is, darker red and gray indicates a higher peak. In panel (c), we see that IVON can handle in-between uncertainty well, which has been shown to be challenging for variational methods by foong2019between.
Figure 4: Using more MC samples during inference (top row) or training (bottom row) can improve both accuracy and NLL, here plotted for ResNet-20 on CIFAR-10.
Figure 5: Panel (a) shows that, for ImageNet, IVON's LOO estimate (solid line with square markers) accurately follows the loss trajectory on an unseen test set (dashed line). Panel (b) shows that same for AdamW which is not as good.
...and 3 more figures

Variational Learning is Effective for Large Deep Networks

TL;DR

Abstract

Variational Learning is Effective for Large Deep Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)