Table of Contents
Fetching ...

phepy: Visual Benchmarks and Improvements for Out-of-Distribution Detectors

Juniper Tyree, Andreas Rupp, Petri S. Clusius, Michael H. Boy

TL;DR

This work addresses the reliability of ML predictions by focusing on out-of-distribution detection through a visual benchmarking approach. It introduces three toy benchmarks to reveal when detectors capture linear, nonlinear, or rare boundary regions and compares a range of unsupervised detectors alongside supervised detectors trained with synthetically generated OOD data. Two practical improvements, t-poking for adaptive FGSM steps and OOD sample weighting to down-weight overlapping regions, are proposed to tighten decision boundaries and improve boundary precision. The findings suggest Gaussian Process uncertainty offers broad generalization at higher cost, while supervised detectors yield crisper but more conservative boundaries, with the authors providing actionable recommendations and an open-source toolkit to facilitate future benchmarking and application.

Abstract

Applying machine learning to increasingly high-dimensional problems with sparse or biased training data increases the risk that a model is used on inputs outside its training domain. For such out-of-distribution (OOD) inputs, the model can no longer make valid predictions, and its error is potentially unbounded. Testing OOD detection methods on real-world datasets is complicated by the ambiguity around which inputs are in-distribution (ID) or OOD. We design a benchmark for OOD detection, which includes three novel and easily-visualisable toy examples. These simple examples provide direct and intuitive insight into whether the detector is able to detect (1) linear and (2) non-linear concepts and (3) identify thin ID subspaces (needles) within high-dimensional spaces (haystacks). We use our benchmark to evaluate the performance of various methods from the literature. Since tactile examples of OOD inputs may benefit OOD detection, we also review several simple methods to synthesise OOD inputs for supervised training. We introduce two improvements, $t$-poking and OOD sample weighting, to make supervised detectors more precise at the ID-OOD boundary. This is especially important when conflicts between real ID and synthetic OOD sample blur the decision boundary. Finally, we provide recommendations for constructing and applying out-of-distribution detectors in machine learning.

phepy: Visual Benchmarks and Improvements for Out-of-Distribution Detectors

TL;DR

This work addresses the reliability of ML predictions by focusing on out-of-distribution detection through a visual benchmarking approach. It introduces three toy benchmarks to reveal when detectors capture linear, nonlinear, or rare boundary regions and compares a range of unsupervised detectors alongside supervised detectors trained with synthetically generated OOD data. Two practical improvements, t-poking for adaptive FGSM steps and OOD sample weighting to down-weight overlapping regions, are proposed to tighten decision boundaries and improve boundary precision. The findings suggest Gaussian Process uncertainty offers broad generalization at higher cost, while supervised detectors yield crisper but more conservative boundaries, with the authors providing actionable recommendations and an open-source toolkit to facilitate future benchmarking and application.

Abstract

Applying machine learning to increasingly high-dimensional problems with sparse or biased training data increases the risk that a model is used on inputs outside its training domain. For such out-of-distribution (OOD) inputs, the model can no longer make valid predictions, and its error is potentially unbounded. Testing OOD detection methods on real-world datasets is complicated by the ambiguity around which inputs are in-distribution (ID) or OOD. We design a benchmark for OOD detection, which includes three novel and easily-visualisable toy examples. These simple examples provide direct and intuitive insight into whether the detector is able to detect (1) linear and (2) non-linear concepts and (3) identify thin ID subspaces (needles) within high-dimensional spaces (haystacks). We use our benchmark to evaluate the performance of various methods from the literature. Since tactile examples of OOD inputs may benefit OOD detection, we also review several simple methods to synthesise OOD inputs for supervised training. We introduce two improvements, -poking and OOD sample weighting, to make supervised detectors more precise at the ID-OOD boundary. This is especially important when conflicts between real ID and synthetic OOD sample blur the decision boundary. Finally, we provide recommendations for constructing and applying out-of-distribution detectors in machine learning.

Paper Structure

This paper contains 6 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison of the reference OOD detector and one-class SVM. Note that the reference is only one possible ideal detection method.
  • Figure 2: Comparison of the Mahalanobis distance, the Local Outlier Factor, Gaussian Processes, and Noise-Contrastive Priors as OOD detectors.
  • Figure 3: Comparison of truncated PCA, neural networks (sharpened by the Mahalanobis distance), and Gaussian Processes as auto-associative OOD detectors.
  • Figure 4: Comparison of supervised OOD detectors using uniform samples and FGSM with a constant step length as OOD synthesis methods.
  • Figure 5: Comparison of supervised OOD detectors using inputs synthesised with FGSM with uniformly sampled and $t$-poking-determined step lengths.
  • ...and 2 more figures