Table of Contents
Fetching ...

Sketched Lanczos uncertainty score: a low-memory summary of the Fisher information

Marco Miani, Lorenzo Beretta, Søren Hauberg

TL;DR

Sketched Lanczos Uncertainty (SLU) is developed: an architecture-agnostic uncertainty score that can be applied to pre-trained neural networks with minimal overhead and consistently outperforms existing methods in the low-memory regime.

Abstract

Current uncertainty quantification is memory and compute expensive, which hinders practical uptake. To counter, we develop Sketched Lanczos Uncertainty (SLU): an architecture-agnostic uncertainty score that can be applied to pre-trained neural networks with minimal overhead. Importantly, the memory use of SLU only grows logarithmically with the number of model parameters. We combine Lanczos' algorithm with dimensionality reduction techniques to compute a sketch of the leading eigenvectors of a matrix. Applying this novel algorithm to the Fisher information matrix yields a cheap and reliable uncertainty score. Empirically, SLU yields well-calibrated uncertainties, reliably detects out-of-distribution examples, and consistently outperforms existing methods in the low-memory regime.

Sketched Lanczos uncertainty score: a low-memory summary of the Fisher information

TL;DR

Sketched Lanczos Uncertainty (SLU) is developed: an architecture-agnostic uncertainty score that can be applied to pre-trained neural networks with minimal overhead and consistently outperforms existing methods in the low-memory regime.

Abstract

Current uncertainty quantification is memory and compute expensive, which hinders practical uptake. To counter, we develop Sketched Lanczos Uncertainty (SLU): an architecture-agnostic uncertainty score that can be applied to pre-trained neural networks with minimal overhead. Importantly, the memory use of SLU only grows logarithmically with the number of model parameters. We combine Lanczos' algorithm with dimensionality reduction techniques to compute a sketch of the leading eigenvectors of a matrix. Applying this novel algorithm to the Fisher information matrix yields a cheap and reliable uncertainty score. Empirically, SLU yields well-calibrated uncertainties, reliably detects out-of-distribution examples, and consistently outperforms existing methods in the low-memory regime.
Paper Structure (43 sections, 6 theorems, 17 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 43 sections, 6 theorems, 17 equations, 9 figures, 5 tables, 2 algorithms.

Key Result

Theorem 2.2

For any $p \times k$ matrix $U$, srft is a $(1\pm \varepsilon)$-subspace embedding for the column space of $U$ with probability $1-\delta$ as long as $s = \Omega((k + \log p) \varepsilon^{-2} \log(k / \delta))$.

Figures (9)

  • Figure 1: OoD detection performance ($\swarrow$) on a ResNet.
  • Figure 2: ggn eigenvalues exponential decay. Average and standard deviation over 5 seeds. Details are in \ref{['sec:spectral_property']}.
  • Figure 3: Sketch sizes $s$ comparison for: LeNet $p=40$K on FashionMnist vs Mnist (left), ResNet $p=300$K on Cifar-10 vs Cifar-corrupted with defocus blur (center), and VisualAttentionNet $p=4$M on CelebA vs Food101 (right). The lower the ratio $s/p$, the stronger the memory efficiency.
  • Figure 4: AUROC scores of Sketched Lanczos Uncertainty vs baselines with memory budget $3p$. slu outperforms the baselines on several choices of ID (\ref{['fig:ceoa']}, \ref{['fig:ceob']}, \ref{['fig:ceoc']}, \ref{['fig:ceod']}, \ref{['fig:ceoe']}) and OoD (x-axis) datasets pairs. Dashed lines are for improved visualization only; see \ref{['tab:results']} for values and standard deviations. Plots \ref{['fig:ceoa']}, \ref{['fig:ceob']}, \ref{['fig:ceoc']}, \ref{['fig:ceod']}, \ref{['fig:ceoe']} are averaged respectively over 10, 10, 5, 3, 1 independently trained models.
  • Figure 5: We study the ggn of a LeNet model with $44.000$ parameters trained on MNIST. We run $40$ iterations of hi-memory Lanczos and low-memory Lanczos. Let $H = [H_1| \dots |H_{40}]$, $\Lambda_H$, $L = [L_1 | \dots |L_{40}]$, and $\Lambda_L$ be the eigenvectors and eigenvalues computed by the two algorithms respectively. We sort both sets of eigenvectors in decreasing order of corresponding eigenvalues. In position $(i, j)$ we plot $\langle H_i, L_j \rangle$. It is apparent that multiple eigenvectors $L_j$ correspond to the same eigenvector $H_i$.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Definition 2.1: Subspace embedding
  • Theorem 2.2: Essentially, Theorem 7 in woodruff2014sketching
  • Lemma 3.0: Sketching low-rank matrices
  • Lemma 3.0: Orthogonalizing the sketch
  • Lemma A.0: Sketching low-rank matrices
  • proof
  • Lemma A.0: Orthogonalizing the sketch
  • proof
  • Lemma A.0: Orthogonalizing the sketch, for matrix queries
  • proof