Table of Contents
Fetching ...

Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection

Genki Osada, Tsubasa Takahashi, Takashi Nishide

TL;DR

This paper investigates why likelihood-based OOD detection using Normalizing Flows and Autoregressive models can fail, particularly when simple or low-complexity images are involved. It introduces Density Concentration Attraction for Simpleness (DCAS), a mechanism whereby less complex inputs map to high-density regions in the NF latent space, increasing both $ ext{log } p({f z})$ and the volume term $ ext{log } | ext{det} J_f({f x})|$, thereby inflating $ ext{log } p({f x})$. Across five NF architectures and PixelCNN++, the authors demonstrate that likelihood-based methods are untrustworthy for OOD detection, and show that incorporating image complexity as an independent variable via a two-variable Gaussian Mixture Model greatly enhances detection performance on MNIST, CIFAR-10, ImageNet, and related datasets. The proposed complexity-aware approach yields robust OOD detection across diverse settings and highlights a practical path to mitigating DCAS-driven failures in DGM-based OOD detection. Overall, the work provides a unified explanation for likelihood failures and offers a scalable, architecture-agnostic remedy with clear implications for safety-critical applications.

Abstract

Out-of-distribution (OOD) detection is crucial to safety-critical machine learning applications and has been extensively studied. While recent studies have predominantly focused on classifier-based methods, research on deep generative model (DGM)-based methods have lagged relatively. This disparity may be attributed to a perplexing phenomenon: DGMs often assign higher likelihoods to unknown OOD inputs than to their known training data. This paper focuses on explaining the underlying mechanism of this phenomenon. We propose a hypothesis that less complex images concentrate in high-density regions in the latent space, resulting in a higher likelihood assignment in the Normalizing Flow (NF). We experimentally demonstrate its validity for five NF architectures, concluding that their likelihood is untrustworthy. Additionally, we show that this problem can be alleviated by treating image complexity as an independent variable. Finally, we provide evidence of the potential applicability of our hypothesis in another DGM, PixelCNN++.

Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection

TL;DR

This paper investigates why likelihood-based OOD detection using Normalizing Flows and Autoregressive models can fail, particularly when simple or low-complexity images are involved. It introduces Density Concentration Attraction for Simpleness (DCAS), a mechanism whereby less complex inputs map to high-density regions in the NF latent space, increasing both and the volume term , thereby inflating . Across five NF architectures and PixelCNN++, the authors demonstrate that likelihood-based methods are untrustworthy for OOD detection, and show that incorporating image complexity as an independent variable via a two-variable Gaussian Mixture Model greatly enhances detection performance on MNIST, CIFAR-10, ImageNet, and related datasets. The proposed complexity-aware approach yields robust OOD detection across diverse settings and highlights a practical path to mitigating DCAS-driven failures in DGM-based OOD detection. Overall, the work provides a unified explanation for likelihood failures and offers a scalable, architecture-agnostic remedy with clear implications for safety-critical applications.

Abstract

Out-of-distribution (OOD) detection is crucial to safety-critical machine learning applications and has been extensively studied. While recent studies have predominantly focused on classifier-based methods, research on deep generative model (DGM)-based methods have lagged relatively. This disparity may be attributed to a perplexing phenomenon: DGMs often assign higher likelihoods to unknown OOD inputs than to their known training data. This paper focuses on explaining the underlying mechanism of this phenomenon. We propose a hypothesis that less complex images concentrate in high-density regions in the latent space, resulting in a higher likelihood assignment in the Normalizing Flow (NF). We experimentally demonstrate its validity for five NF architectures, concluding that their likelihood is untrustworthy. Additionally, we show that this problem can be alleviated by treating image complexity as an independent variable. Finally, we provide evidence of the potential applicability of our hypothesis in another DGM, PixelCNN++.
Paper Structure (66 sections, 2 theorems, 19 equations, 12 figures, 7 tables)

This paper contains 66 sections, 2 theorems, 19 equations, 12 figures, 7 tables.

Key Result

Lemma 1

Let $f: \mathcal{X} \rightarrow \mathcal{Z}$ be an invertible function being locally $L_{\mathcal{A}}$-Lipschitz for $\mathcal{A} \subset \mathcal{Z}$. For all ${\bf z}' \in \mathcal{A}$, let ${\bf x}' = f^{-1}({\bf z}')$, ${\bf z} = \mathop{\mathrm{\mathbb{E}}}\nolimits {\bf z}' = [\mathop{\mathr

Figures (12)

  • Figure 1: DCAS (Remark \ref{['rem:dcas']}) attracts less complex images to the high-density region in latent space. $\mathcal{Z}$ represents a Gaussian latent space trained on CIFAR-10. $O_{\mathcal{Z}}$ represents the origin of $\mathcal{Z}$. The dark blue circle represents the typical set in $\mathcal{Z}$, identified as In-Dist by the typicality test. Complex OOD images like Noise-1 and CIFAR-10 (Noise-4-16) are mapped far from $O_{\mathcal{Z}}$. However, due to DCAS, less complex images like Noise-16 and CIFAR-10 (Pool-4) are mapped closer to $O_{\mathcal{Z}}$ than the circle. We hypothesize that SVHN (an OOD image) should be mapped far beyond the circle. However, due to its less complexity, it is attracted towards $O_{\mathcal{Z}}$ and coincidentally falls on the circle, leading to the misidentification of SVHN as In-Dist by the typicality test.
  • Figure 2: Complexity controlled images. Image complexity increases from left to right in both rows. Top: Pooling noise images with pooling size $\kappa$ decreases as 32, 16, 8, 4, 2, and 1 from left to right. Bottom: Manipulated CIFAR-10 with Pool-8, Pool-4, Pool-2, Noise-4-4, Noise-4-8, and Noise-4-16 from left to right.
  • Figure 3: Plots for manipulated CIFAR-10. Left plot shows complexity vs. $\left\lVert{\bf z}\right\rVert$, supporting Remark \ref{['rem:dcas']}. Right plot shows volume vs. $\left\lVert{\bf z}\right\rVert$, supporting Remark \ref{['rem:volume']} and Observation \ref{['obs:corr']}. We note that $\left\lVert{\bf z}\right\rVert \propto - \sqrt{\log p({\bf z})}$.
  • Figure 4: Complexity vs. $\left\lVert{\bf z}\right\rVert ( \propto - \sqrt{\log p({\bf z})})$ for OOD datasets. Blue is In-Dist (CIFAR-10), green is SVHN, and Red is CelebA. From left to right, Glow, ResFlow, and IDF. Glow and ResFlow exhibit more pronounced separation between datasets compared to IDF. The GMM is trained to capture the in-distribution in these two-dimensional spaces.
  • Figure 5: Existing methods fail in specific combinations. The cases where In-Dist is CIFAR-10 are shown in the left columns, and the cases where In-Dist is SVHN are shown in the right columns. Top row: histograms of CALT. The $x$-axis is $S_{\text{CALT}}({\bf x}) = \log p({\bf x}) +C({\bf x})$. Bottom row: histograms of TTL. The $x$-axis is $\left\lVert{\bf z}\right\rVert ( \propto - \sqrt{\log p({\bf z})})$. CALT fails to detect CIFAR-10 samples as OOD when SVHN is In-Dist (top-right). TTL cannot identify SVHN samples as OOD when CIFAR-10 is In-Dist (bottom-left).
  • ...and 7 more figures

Theorems & Definitions (9)

  • Definition 1: Image complexity
  • Definition 2: Local Lipschitz continuity
  • Remark 1
  • Remark 2
  • Remark 3
  • Lemma 1
  • proof
  • Lemma 2
  • proof