Table of Contents
Fetching ...

OOD Detection with immature Models

Behrooz Montazeran, Ullrich Köthe

TL;DR

The paper addresses the paradox that likelihood-based deep generative models often fail to separate in-distribution from out-of-distribution data, especially when the ID data are more complex. It proposes a score based on layer-wise gradients and Fisher information approximations, and demonstrates that partially trained (immature) Glow models can achieve equal or superior OOD detection performance compared with fully converged models, often with near-perfect AUROC on multiple dataset pairs. The key finding is that partial training can yield a favorable gap in ID–OOD score distributions due to support overlap dynamics, enabling efficient and robust OOD detection. This challenges the assumption that deeper convergence always yields better downstream OOD performance and suggests practical benefits in computational efficiency and model selection for safety-critical applications.

Abstract

Likelihood-based deep generative models (DGMs) have gained significant attention for their ability to approximate the distributions of high-dimensional data. However, these models lack a performance guarantee in assigning higher likelihood values to in-distribution (ID) inputs, data the models are trained on, compared to out-of-distribution (OOD) inputs. This counter-intuitive behaviour is particularly pronounced when ID inputs are more complex than OOD data points. One potential approach to address this challenge involves leveraging the gradient of a data point with respect to the parameters of the DGMs. A recent OOD detection framework proposed estimating the joint density of layer-wise gradient norms for a given data point as a model-agnostic method, demonstrating superior performance compared to the Typicality Test across likelihood-based DGMs and image dataset pairs. In particular, most existing methods presuppose access to fully converged models, the training of which is both time-intensive and computationally demanding. In this work, we demonstrate that using immature models,stopped at early stages of training, can mostly achieve equivalent or even superior results on this downstream task compared to mature models capable of generating high-quality samples that closely resemble ID data. This novel finding enhances our understanding of how DGMs learn the distribution of ID data and highlights the potential of leveraging partially trained models for downstream tasks. Furthermore, we offer a possible explanation for this unexpected behaviour through the concept of support overlap.

OOD Detection with immature Models

TL;DR

The paper addresses the paradox that likelihood-based deep generative models often fail to separate in-distribution from out-of-distribution data, especially when the ID data are more complex. It proposes a score based on layer-wise gradients and Fisher information approximations, and demonstrates that partially trained (immature) Glow models can achieve equal or superior OOD detection performance compared with fully converged models, often with near-perfect AUROC on multiple dataset pairs. The key finding is that partial training can yield a favorable gap in ID–OOD score distributions due to support overlap dynamics, enabling efficient and robust OOD detection. This challenges the assumption that deeper convergence always yields better downstream OOD performance and suggests practical benefits in computational efficiency and model selection for safety-critical applications.

Abstract

Likelihood-based deep generative models (DGMs) have gained significant attention for their ability to approximate the distributions of high-dimensional data. However, these models lack a performance guarantee in assigning higher likelihood values to in-distribution (ID) inputs, data the models are trained on, compared to out-of-distribution (OOD) inputs. This counter-intuitive behaviour is particularly pronounced when ID inputs are more complex than OOD data points. One potential approach to address this challenge involves leveraging the gradient of a data point with respect to the parameters of the DGMs. A recent OOD detection framework proposed estimating the joint density of layer-wise gradient norms for a given data point as a model-agnostic method, demonstrating superior performance compared to the Typicality Test across likelihood-based DGMs and image dataset pairs. In particular, most existing methods presuppose access to fully converged models, the training of which is both time-intensive and computationally demanding. In this work, we demonstrate that using immature models,stopped at early stages of training, can mostly achieve equivalent or even superior results on this downstream task compared to mature models capable of generating high-quality samples that closely resemble ID data. This novel finding enhances our understanding of how DGMs learn the distribution of ID data and highlights the potential of leveraging partially trained models for downstream tasks. Furthermore, we offer a possible explanation for this unexpected behaviour through the concept of support overlap.

Paper Structure

This paper contains 15 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Visualization of the anomalous behaviour in density-based generative models (GLOW) Despite training the model on (left) CIFAR-10 and (right) CelebA as in-distribution (ID) datasets, the model assigns higher likelihoods (higher negative bits per dimension values) to OOD samples from SVHN. This surprising observation is especially pronounced when the complexity of the ID dataset is higher than that of the OOD dataset, highlighting a key limitation of likelihood-based OOD detection in deep generative models. Further comparisons of this phenomenon are illustrated in § \ref{['sec:NLL_as_OOD']}
  • Figure 2: Layer-wise gradient-based OOD scoring effectively separates ID and OOD samples. The GLOW model was trained on two ID datasets: (left) ImageNet32 and (right) CelebA, while tested against four OOD datasets from : SVHN, GTSRB, CIFAR-10, CelebA and ImageNet32. The gradient values are demonstrating variability across layers and the distinct separation between ID and OOD data distributions. The scoring function $S_{\bm{\theta}^{(l)}}(\bm{x}_{b=\{1,5\}}) = \log \{ \| \nabla_{\bm{\theta}^{(l)}} ( \sum_b \ell(\bm{x}_b) ) \|_2^2 \}$ is computed using $b=5$, indicating that each score is calculated using a batch of five random samples. The near-perfect separation observed between ID and OOD samples highlights the effectiveness of this method. Additional results with varying batch sizes (e.g., $b=1$ and $b=5$) and other IDs are detailed in § \ref{['sec:Additional_Results']}.
  • Figure 3: Progressive widening of the gap in histograms of layer-wise gradient-based OOD scores with batch size of $5$. The figure illustrates how training on a complex ID dataset, such as ImageNet32, affects the gap between histograms of OOD scores for ID and OOD samples from (GTSRB, CIFAR-10, CelebA, and SVHN). Figure (left) represents the results after 10 epochs, while figure (right) shows the results after 250 epochs. Despite the increasing gap as training progresses, AUROC scores remain unchanged compared to a partially trained model, indicating that early training may suffice for OOD detection tasks. This widening gap, while reflecting improved separation, incurs higher computational costs. Additional experiments, including results for other batch sizes (e.g., $b=1$) and ID datasets, are discussed in § \ref{['sec:Additional_Results']}.
  • Figure 4: Transition from gap to overlap in histograms of layer-wise gradient-based OOD scores using batch size of $5$ as training progresses. This phenomenon occurs when the ID dataset is less complex compared to the OOD samples, resulting in a decline in OOD detection performance with a fully converged model that generates higher-quality images. The GLOW model was trained on the ID dataset SVHN for: (left) one epoch and (right) 250 epochs, and tested against four OOD datasets: GTSRB, CIFAR-10, CelebA, and ImageNet32. As training progresses, the distinct separation between ID and OOD data distributions deteriorates. Additional results with varying batch sizes (e.g., $b=1$ and $b=5$) and other IDs are detailed in § \ref{['sec:Additional_Results']}.
  • Figure 5: Overlap Coefficient (OVL) Overlap area (walker2021newmeasureweitzman1970overlap ) between the PDFs of negative BIDs for an ID dataset (CIFAR-10) and an OOD dataset (GTSRB), evaluated using the Glow model trained on CIFAR-10 at different epochs. Figure (a) shows the OVL value of 0.8357 after 50 epochs, and (b) value of 0.8635 for the fully trained model (lower values are better). The increase in OVL reflects a larger overlap of distributions as training progresses, which correlates with a decline in AUC values of OOD detection: 0.6310 (50 epochs), and 0.5502 (fully trained), (higher values are better).
  • ...and 4 more figures