Variational Learning is Effective for Large Deep Networks
Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Clement Bazan, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, Thomas Möllenhoff
TL;DR
This work challenges the belief that variational learning cannot scale to large deep networks by introducing IVON, a scalable Hessian-aware optimizer that directly optimizes the variational objective with costs comparable to Adam. IVON achieves state-of-the-art accuracy and better predictive uncertainty on large models, including GPT-2 variants and ResNets, and enables practical use cases such as fine-tuning, model merging, and data-sensitivity diagnostics. It demonstrates compelling results across language modeling and image classification, and provides methods for reliable generalization estimation and robust uncertainty under distribution shift. The findings suggest variational principles can yield tangible benefits in calibration, uncertainty, and transferability for large-scale deep learning applications.
Abstract
We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.
