Debiasing Mini-Batch Quadratics for Applications in Deep Learning
Lukas Tatzel, Bálint Mucsányi, Osane Hackel, Philipp Hennig
TL;DR
The paper identifies a systematic bias in mini-batch quadratic approximations used for second-order optimization and Laplace-based uncertainty in deep learning. It shows that computing curvatures and slopes on a subset of data inflates top-direction curvature and misaligns eigenstructures relative to full-batch quantities, leading to misguided Newton steps and overconfident uncertainty estimates. The authors derive the biases and propose simple two-batch debiasing strategies, including Debiased Conjugate Gradients and Debiased K-FAC Laplace, demonstrating improved stability and calibration across CNNs and vision transformers while maintaining comparable computational budgets. These results establish a practical design principle for stochastic curvature-based methods and enhance their reliability in large-scale models and datasets.
Abstract
Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network's loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.
