Table of Contents
Fetching ...

Debiasing Mini-Batch Quadratics for Applications in Deep Learning

Lukas Tatzel, Bálint Mucsányi, Osane Hackel, Philipp Hennig

TL;DR

The paper identifies a systematic bias in mini-batch quadratic approximations used for second-order optimization and Laplace-based uncertainty in deep learning. It shows that computing curvatures and slopes on a subset of data inflates top-direction curvature and misaligns eigenstructures relative to full-batch quantities, leading to misguided Newton steps and overconfident uncertainty estimates. The authors derive the biases and propose simple two-batch debiasing strategies, including Debiased Conjugate Gradients and Debiased K-FAC Laplace, demonstrating improved stability and calibration across CNNs and vision transformers while maintaining comparable computational budgets. These results establish a practical design principle for stochastic curvature-based methods and enhance their reliability in large-scale models and datasets.

Abstract

Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network's loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.

Debiasing Mini-Batch Quadratics for Applications in Deep Learning

TL;DR

The paper identifies a systematic bias in mini-batch quadratic approximations used for second-order optimization and Laplace-based uncertainty in deep learning. It shows that computing curvatures and slopes on a subset of data inflates top-direction curvature and misaligns eigenstructures relative to full-batch quantities, leading to misguided Newton steps and overconfident uncertainty estimates. The authors derive the biases and propose simple two-batch debiasing strategies, including Debiased Conjugate Gradients and Debiased K-FAC Laplace, demonstrating improved stability and calibration across CNNs and vision transformers while maintaining comparable computational budgets. These results establish a practical design principle for stochastic curvature-based methods and enhance their reliability in large-scale models and datasets.

Abstract

Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network's loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.

Paper Structure

This paper contains 42 sections, 53 equations, 21 figures, 1 algorithm.

Figures (21)

  • Figure 1: A systematic bias? We compute five mini-batch quadratics $q(\,\cdot\,; \mathcal{B}_m)$ with batch size $\vert\mathcal{B}_m\vert = 512$ for the loss landscape of the fully trained All-CNN-C model on CIFAR-100 data around ${\bm{\theta}}_0 \gets {\bm{\theta}}_\star$ (shown as ). Each mini-batch quadratic defines a 2D subspace spanned by the top two eigenvectors ${\bm{u}}_1, {\bm{u}}_2$ of ${\bm{H}}_{\mathcal{B}_m}$, in which we evaluate (i) the quadratic $q({\bm{\theta}}_\star + \tau_1 {\bm{u}}_1 + \tau_2 {\bm{u}}_2; \mathcal{B}_m)$ itself (shown in ) and (ii) the full-batch quadratic $q({\bm{\theta}}_\star + \tau_1 {\bm{u}}_1 + \tau_2 {\bm{u}}_2; \mathcal{D})$ (shown in ). In that subspace, the mini-batch quadratic is much "narrower" than the full-batch version which leads to overly small Newton steps and overconfident uncertainty estimates via the Laplace approximation.
  • Figure 2: Directional slopes and curvatures are biased. We use the CIFAR-100 dataset with the fully trained All-CNN-C model and draw three mini-batches $\mathcal{B}_m$ of size $\vert\mathcal{B}_m\vert = 512$ to compute the top $100$ eigenvectors ${\bm{u}}_1, \ldots, {\bm{u}}_{100}$. For each mini-batch/column, we show the directional slopes (Top) and curvatures (Bottom) evaluated on (i) $q({\bm{\theta}}_\star; \mathcal{B}_{m})$ (i.e. on the same mini-batch of data) as , (ii) $q({\bm{\theta}}_\star; \mathcal{B}_{m'})$ for $m' \neq m$ (i.e. for all other mini-batches) as and (iii) the full-batch quadratic (the average of the orange and all blue dots, see \ref{['eq:average_directional_slopes_curvatures']}) as ✚. For the top panel, we switch the order and sign of the eigenvectors such that the orange dots are all above zero and in descending order. There is a strong, systematic bias, particularly in the curvature: Computing the eigenvectors and directional curvatures on the same data results in over-estimation by roughly one order of magnitude.
  • Figure 3: In practice, eigenspaces are misaligned. We reuse the setting of \ref{['fig:bias']} and compute the top $100$ eigenvectors ${\bm{U}}_m \! \in \mathbb{R}^{P \times 100}$ for $\{\mathcal{B}_m\}_{m \in \{0, 1, 2\}}$. The weights ${\bm{\Omega}}_{i, j}$ are shown as a $100 \times 100$ greyscale image (color ranges from black for ${\bm{\Omega}}_{i, j} \leq 10^{-8}$ to white for ${\bm{\Omega}}_{i, j} = 1$) for $m \in \{0, 1\}$, $m' \in \{0, 1, 2\}$. Clearly, the eigenspaces for different mini-batches are not perfectly aligned as eigenvectors from $\mathcal{B}_m$ overlap with several eigenvectors from $\mathcal{B}_{m'}$.
  • Figure 4: CG update magnitudes are biased. Same setting as \ref{['fig:bias_ggn_512']}(Bottom). We run CG on $\{\mathcal{B}_m\}_{m \in \{0, 1, 2\}}$ and show the directional update magnitudes $\tau_1, \ldots, \tau_{10}$ for the first $10$CG steps using (i) the same mini-batch $\mathcal{B}_m$ (as ), (ii) all other mini-batches (as ) and (iii) the entire training set (as ✚). The magnitudes are given by the negative ratio of the directional slope and curvature (see \ref{['eq:cg_update_with_magnitude']}) and thus inherit the attached biases. Note that most of the update magnitudes that are based on a single mini-batch of data () have the wrong sign resulting in detrimental updates in the wrong direction.
  • Figure 5: Debiased CG is much more stable than the single-batch approach. We compare CG runs without curvature damping ($\delta = 0$) with $K = 30$ iterations for the fully trained All-CNN-C model on the CIFAR-100 dataset in terms of training/test loss/accuracy at similar computational cost: The single-batch approach (shown as ) uses one mini-batch of size $1024$ while the debiased approach (shown as ) uses two mini-batches of size $512$ each. Both approaches use the GGN curvature proxy and are run $5$ times on different mini-batches. The markers ◆ and ◆ are placed at peak performance. While the single-batch runs diverge quickly, the debiased CG runs are stable.
  • ...and 16 more figures