Table of Contents
Fetching ...

A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection

Randolph W. Linderman, Yiran Chen, Scott W. Linderman

TL;DR

This work links Bayesian nonparametric Dirichlet process mixture models to the Relative Mahalanobis Distance score for out-of-distribution detection, showing that RMDS approximates the inlier probability under a Gaussian DPMM with tied covariances. It then extends this link by introducing hierarchical Gaussian DPMMs that allow class-specific covariances to be learned with shared statistical strength, via full, diagonal, and coupled diagonal covariance models. The authors derive EM algorithms to fit hyperparameters and provide closed-form predictive densities (including Student's t forms) to compute OOD scores. Empirical results on synthetic data and the OpenOOD benchmark demonstrate that hierarchical DPMMs improve OOD detection, especially when per-class covariance structures differ and data per class are limited, while highlighting limitations of the full covariance model in high dimensions and the practical utility of diagonal variants.

Abstract

Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.

A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection

TL;DR

This work links Bayesian nonparametric Dirichlet process mixture models to the Relative Mahalanobis Distance score for out-of-distribution detection, showing that RMDS approximates the inlier probability under a Gaussian DPMM with tied covariances. It then extends this link by introducing hierarchical Gaussian DPMMs that allow class-specific covariances to be learned with shared statistical strength, via full, diagonal, and coupled diagonal covariance models. The authors derive EM algorithms to fit hyperparameters and provide closed-form predictive densities (including Student's t forms) to compute OOD scores. Empirical results on synthetic data and the OpenOOD benchmark demonstrate that hierarchical DPMMs improve OOD detection, especially when per-class covariance structures differ and data per class are limited, while highlighting limitations of the full covariance model in high dimensions and the practical utility of diagonal variants.

Abstract

Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.

Paper Structure

This paper contains 42 sections, 2 theorems, 72 equations, 5 figures, 2 tables.

Key Result

proposition 1

The inlier probability of a general DPMM with concentration $\alpha$ can be expressed as follows, where $\sigma(u) = (1 + e^{-u})^{-1}$ is the logistic (sigmoid) function, ${\overline{N}=\tfrac{1}{K} \sum_k N_k}$ is the average cluster size, and Here, $\lambda_k$ is the log density ratio of the posterior predictive and prior predictive distributions from eq. eq:outlier_prob.

Figures (5)

  • Figure 1: Förstner-Moonen distance between all pairs of covariance matrices from the 1000 classes of the Imagenet-1k ViT-B-16 feature space (Data) and between 1000 samples of the Wishart null distribution, $\mathrm{W}(\overline{N}, \hat{\Sigma}/\overline{N})$. See \ref{['app:exploratory_details']} for complete details. This discrepancy motivates the hierarchical models below.
  • Figure 2: A: Diagonal of empirical covariance matrices, $\mathrm{diag}(\hat{\Sigma}_k)$ for five randomly chosen clusters (colored lines) over dimensions. Compared to the diagonal of the average covariance matrix, $\mathrm{diag}(\hat{\Sigma})$, individual clusters tend to have systematically larger or smaller variances than average. B: The correlation between dimensions of the deviation from the mean, $\hat{\Sigma}_k - \hat{\Sigma}$, of the diagonal components. The strong positive correlations between all but the first few dimensions indicates that the relationship observed in A is consistent across all clusters.
  • Figure 3: Synthetic experiments panel. Example sampled 2D dataset from DPMM with params $\nu_0=4$ (A) and $16$ (B). Each data set has $K=10$ clusters with $N_k=20$ training data points each (colored dots). We evaluate performance on classifying outliers (gray dots) drawn from the prior predictive distribution. C: Performance of DPMM models vs. RMDS when sweeping over $\nu_0$ with $N_k=20$ shows that DPMMs outperform when $\nu_0$ is small and there is greater variation in the $\Sigma_k$'s. D: Independent RMDS performance vs. DPMMs as a function of $N_k$ with $\nu_0=4$. Independent RMDS only performs well when there are adequate numbers of samples per class.
  • Figure 4: Performance on "near OOD", "far OOD", and in-distribution classification as a function of the feature dimension. We projected the 768-dimensional ViT-B-16 features into lower dimensions using PCA, then projected into the eigenspace of the average within-class covariance. We compared the tied model (with full covariance) to the hierarchical model with full, diagonal, and coupled diagonal covariance and measured performance by area under the receiver operator curve (AUROC).
  • Figure 5: Tied DPGMM OOD score correlation to RMDS ren21rmds.

Theorems & Definitions (4)

  • proposition 1
  • proof
  • proposition 2
  • proof