Table of Contents
Fetching ...

Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection

Sam Dauncey, Chris Holmes, Christopher Williams, Fabian Falck

TL;DR

This work addresses the failure of likelihood-based deep generative models to distinguish OOD data by exploiting the gradient of the log-likelihood with respect to model parameters, formalized via the Fisher Information Metric (FIM). It shows that the FIM is diagonally dominant in layer blocks, motivating a practical, layer-wise set of gradient features whose log-norms approximate chi-square behavior; these are combined into a model-agnostic, hyperparameter-free OOD detector that is representation-invariant under invertible transformations. Empirically, the method yields stronger OOD discrimination than the Typicality test on Glow models across multiple image datasets and achieves competitive results for diffusion models, while exhibiting a clear invariance property and scalability across layers. The approach offers a principled, gradient-based alternative for unsupervised OOD detection with broad applicability to differentiable likelihood-based models.

Abstract

Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.

Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection

TL;DR

This work addresses the failure of likelihood-based deep generative models to distinguish OOD data by exploiting the gradient of the log-likelihood with respect to model parameters, formalized via the Fisher Information Metric (FIM). It shows that the FIM is diagonally dominant in layer blocks, motivating a practical, layer-wise set of gradient features whose log-norms approximate chi-square behavior; these are combined into a model-agnostic, hyperparameter-free OOD detector that is representation-invariant under invertible transformations. Empirically, the method yields stronger OOD discrimination than the Typicality test on Glow models across multiple image datasets and achieves competitive results for diffusion models, while exhibiting a clear invariance property and scalability across layers. The approach offers a principled, gradient-based alternative for unsupervised OOD detection with broad applicability to differentiable likelihood-based models.

Abstract

Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.
Paper Structure (44 sections, 4 theorems, 19 equations, 18 figures, 7 tables, 3 algorithms)

This paper contains 44 sections, 4 theorems, 19 equations, 18 figures, 7 tables, 3 algorithms.

Key Result

Proposition 1

Let $p^{\boldsymbol{\theta}}_{\mathcal{X}}(\boldsymbol{x})$ and $p^{\boldsymbol{\theta}}_{\mathcal{T}}(\boldsymbol{t})$ be two probability density functions corresponding to the same model distribution $p^{\boldsymbol{\theta}}$ being represented on two different measure spaces $\mathcal{X}$ and $\ma

Figures (18)

  • Figure 1: Counter-intuitive properties of likelihood-based generative models. Histogram of the negative log-likelihoods inferred from a Diffusion ho2020denoising model [Left] and a Glow kingma2018glow model [right] trained on one of four image datasets (corresponding to the four subplots) and evaluated on the test set of all four datasets, respectively. For diffusion models we use the negative log-likelihood from one step of the diffusion process $p^{\boldsymbol{\theta}}(\boldsymbol{x_0} \vert \boldsymbol{x}_1)$. For both models we scale the log-likelihoods by the dimensionality of the data, in this case $3 \times 32 \times 32$. This Figure replicates the results in the seminal paper by nalisnick2018deep, noting that our results for diffusion models are novel. We find that the training dataset has a counter-intuitively small impact on the ordering of the datasets as ranked by log-likelihood.
  • Figure 2: The log-likelihood heavily depends on data representationlan2021. Here we plot the first two samples of the CIFAR10 dataset and the difference in Bits Per Dimension (BPD) induced by changing from an RGB to an HSV colour model: $\Delta^{RGB \to HSV}_{BPD} = \frac{\log_2 p_{RGB}(\mathbf{x}) - \log_2 p_{HSV}(\mathbf{x})}{3 \times 32 \times 32}.$ In Appendix \ref{['app:Additional RGB-HSV']}, we provide experimental details and inFig. \ref{['fig:RGB-HSV appendix']} replicate this for the first 20 samples, where we observe ${\Delta^{RGB \to HSV}_{BPD}}$ values ranging from $0.18$ to $1.76$
  • Figure 3: Layer-wise gradients of the log-likelihood (the score) are highly informative for OOD detection. Their size differs by orders of magnitudes between layers, and they are not strictly correlated, rendering layer-wise gradients (in contrast to the full gradient) discriminatory features for OOD detection. In each row, we randomly select two layers with parameters $\boldsymbol{\theta}_i$, $\boldsymbol{\theta}_j$ from a Glow kingma2018glow model [Top] or a Diffusion model ho2020denoising [Bottom], which have $1353$ and $276$ layers, respectively. The models are trained on CelebA, a dataset that has proved challenging for OOD detection in previous work nalisnick2019. We then evaluate this model on batches $\boldsymbol{x}_{1:B}$ ($B=5$) drawn from the in-distribution and OOD test datasets and compute the squared layer-wise $L^2$-norm of the gradients of the log-likelihood with respect to the parameters of the layer, i.e. $f_{\boldsymbol{\theta}_{j}}(\boldsymbol{x}_{1:B}) = \left\| \nabla_{\boldsymbol{\theta}_j} (\sum_{b=1}^B l(\boldsymbol{x}_b)) \right\| _2^2$. [Left and Middle] shows the two layer-wise gradients separately, [Right] shows their interaction in a scatter plot. In Appendix \ref{['app:Additional experimental details and results']} Figures \ref{['fig: gradientHistograms_app1']} - \ref{['fig: gradientHistograms_app3']}, we provide our complete results, showing more layers from three likelihood-based generative models, each trained and evaluated on five datasets.
  • Figure 4: The layer-wise FIM has large absolute diagonal values. We randomly select two layers $\theta_i$ and $\theta_j$ from a Glow model trained on CelebA, and randomly select $\max(50, \lvert \theta_j \rvert )$ weights from each layer. We then compute slices of the FIM using the method described in Equation (\ref{['eq:mc_FIM_approx']}) and plot the results, with dark blue colours at coordinates $(\alpha, \beta)$ corresponding to larger values for the corresponding element of the FIM. In order to maintain visual fidelity of the plot when weights between layers vary by orders of magnitudes, we normalise row $\alpha$ by a factor of $\sqrt{F_{\alpha \alpha}}$ where $F_{\alpha \alpha}$ indicates the element of the FIM at coordinates $(\alpha, \alpha)$, and likewise for the columns, which could be equivalently formulated as re-scaling the model parameters by this factor. The same plots using diffusion models and of the raw values $F_{\alpha\beta}$ (without row and column-wise normalisation) are presented in Appendix \ref{['app:FIM additional plots']}, Figures \ref{['fig:raw FIM glow']} & \ref{['fig:raw FIM diffusion']}.
  • Figure 5: The $L^2$-norms of layer-wise gradients have little correlation. We select layers with parameters $\boldsymbol{\theta}_i, \boldsymbol{\theta}_j$ and measure the correlation of the logarithm gradient $L^2$-norms $\log f_{\boldsymbol{\theta}_{i}}(\boldsymbol{x})$. Binning these correlations by the distance between the layers $\vert i - j\vert$ and averaging across correlations of this distance gives the above plot. We note that there is a strong correlation in $L^2$-norm between adjacent layers, but that this correlation quickly decays for both in-distribution and out-of-distribution data. We hypothesise that this enables our approximation of the FIM which assumes independence across layers to provide good performance.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Proposition 1
  • Remark 1
  • Proposition 2
  • Proposition 3
  • Proposition 4