Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox

Xingming Long; Jie Zhang; Shiguang Shan; Xilin Chen

Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox

Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR

A benchmark named Incremental Shift OOD is constructed to address the issue of some marginal OOD samples actually have close semantic contents to the in-distribution (ID) sample, which makes determining the OOD sample a Sorites Paradox.

Abstract

Most existing out-of-distribution (OOD) detection benchmarks classify samples with novel labels as the OOD data. However, some marginal OOD samples actually have close semantic contents to the in-distribution (ID) sample, which makes determining the OOD sample a Sorites Paradox. In this paper, we construct a benchmark named Incremental Shift OOD (IS-OOD) to address the issue, in which we divide the test samples into subsets with different semantic and covariate shift degrees relative to the ID dataset. The data division is achieved through a shift measuring method based on our proposed Language Aligned Image feature Decomposition (LAID). Moreover, we construct a Synthetic Incremental Shift (Syn-IS) dataset that contains high-quality generated images with more diverse covariate contents to complement the IS-OOD benchmark. We evaluate current OOD detection methods on our benchmark and find several important insights: (1) The performance of most OOD detection methods significantly improves as the semantic shift increases; (2) Some methods like GradNorm may have different OOD detection mechanisms as they rely less on semantic shifts to make decisions; (3) Excessive covariate shifts in the image are also likely to be considered as OOD for some methods. Our code and data are released in https://github.com/qqwsad5/IS-OOD.

Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox

TL;DR

Abstract

Paper Structure (33 sections, 9 equations, 20 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 20 figures, 8 tables, 1 algorithm.

Introduction
Benchmark Construction
Feature Decomposition
Shift Measuring and Subsets Division
Generation of Syn-IS
Metrics
Experiments
Main Results on ImageNet-21K
Results on Syn-IS
Conclusion and Discussion
Comparison with Other Benchmarks
Analysis of the Proposed Decomposition Method
Details of the Subsets Division
Prompts Used for Syn-IS Generation
Evaluated OOD Detection Methods
...and 18 more sections

Figures (20)

Figure 1: Examples of images from IS-OOD benchmark. ImageNet-21K is divided into subsets with different semantic and covariate shift levels relative to ImageNet-1K. As semantic shift increases, images of the subsets change from marginal samples (such as animal subspecies) to more distinct OOD categories (such as "gasket"). As covariate shift increases, the covariate contents transition from object-centered real photos to synthetic images, and from high-definition color images to low-resolution monochrome images.
Figure 2: Examples of noise caused by inaccurate semantic labels. The images in the row below are semantically similar to the ID data (images in the row above), yet they are considered OOD samples in some benchmarks for their labels.
Figure 3: Overview of Language Aligned Image feature Decomposition (LAID) method. We first construct texts using different semantic and covariate prompts and train an orthogonal transformation matrix for the decomposition in the text feature space. Then, we can apply this matrix to the decomposition in the image feature space leveraging the alignment property of the CLIP model.
Figure 4: Examples for the images in different Syn-IS subsets and their corresponding prompts. Subsets with low covariate shifts typically include more realistic-style images (such as "HDR photo"), whereas subsets with high covariate shifts tend to contain more abstract-style images (such as "papercut").
Figure 5: OOD detection performance on all ImageNet-21K subsets with different semantic and covariate shift levels. "N/A" indicates the number of data in this subset is too small for a fair evaluation.
...and 15 more figures

Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox

TL;DR

Abstract

Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox

Authors

TL;DR

Abstract

Table of Contents

Figures (20)