SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models

Shu Zou; Xinyu Tian; Qinyu Zhao; Zhaoyuan Yang; Jing Zhang

SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models

Shu Zou, Xinyu Tian, Qinyu Zhao, Zhaoyuan Yang, Jing Zhang

TL;DR

This work investigates the ability of image-text comprehension among different semantic-related ID labels in VLMs and proposes a novel post-hoc strategy called SimLabel, which enhances the separability between ID and OOD samples by establishing a more robust image-class similarity metric that considers consistency over a set of similar class labels.

Abstract

Detecting out-of-distribution (OOD) data is crucial in real-world machine learning applications, particularly in safety-critical domains. Existing methods often leverage language information from vision-language models (VLMs) to enhance OOD detection by improving confidence estimation through rich class-wise text information. However, when building OOD detection score upon on in-distribution (ID) text-image affinity, existing works either focus on each ID class or whole ID label sets, overlooking inherent ID classes' connection. We find that the semantic information across different ID classes is beneficial for effective OOD detection. We thus investigate the ability of image-text comprehension among different semantic-related ID labels in VLMs and propose a novel post-hoc strategy called SimLabel. SimLabel enhances the separability between ID and OOD samples by establishing a more robust image-class similarity metric that considers consistency over a set of similar class labels. Extensive experiments demonstrate the superior performance of SimLabel on various zero-shot OOD detection benchmarks. The proposed model is also extended to various VLM-backbones, demonstrating its good generalization ability. Our demonstration and implementation codes are available at: https://github.com/ShuZou-1/SimLabel.

SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related works
Preliminaries
Methodology
CLIP-based OOD detection
Motivation
Similar class generation
Similar classes based on text hierarchy
Similar classes from large language models
Similar classes with image-text alignment
OOD detection with similar classes
Experiment
Experiment setup
Experimental results and analysis
Discussion
...and 6 more sections

Figures (10)

Figure 1: (a) Illustration of VLMs guided OOD detection for ID (top image from ImageNet imagenet2009) and OOD (bottom image from iNaturalist vanhorn2018inaturalist) samples, respectively. (b) Comparison between the proposed SimLabel and the baseline MCM ming2022delving for ID (top) and OOD (bottom) samples, demonstrating how our method detects by aggregating OOD scores across similar ID labels (yellow and blue bars denote image & similar-classes-labels similarity and image & class-labels similarity respectively).
Figure 2: This figure demonstrate the sorted average similarity between a specific class of images with the whole label set (1,000 labels in our case). Images $x \in \mathcal{X}$ show high similarity to several ID classes rather than one single label. Notably, we follow the score design in MCM ming2022delving where the similarity are transformed with the Softmax function.
Figure 3: This figure illustrates samples of similar classes for the class "Great White Shark" using methods in Sec. \ref{['sec:Co-Ocu_sim_class_generation']} and Sec. \ref{['sec:similar_class_GPT']}. Left The similar classes generated from the ID labels. Right The similar classes generated by LLM.
Figure 4: Overview of the SimLabel zero-shot OOD detection framework. The image encoder first encodes ID and OOD images into image embeddings $\mathbf{h}$ and $\mathbf{h}'$, respectively. For every class label (represented as blue blocks) in the ID label set $\mathcal{L}$, similar classes (represented as yellow blocks) are generated through the process of similar class generation defined in Sec. \ref{['sec:choise_Sim_Class']}. The text encoder extracts ID and their similar class labels into text embeddings with prompts and the image-text similarities are measured using the function defined in Eq. \ref{['equa:CLIP_classify']}. Both image and text encoders are frozen. The below charts indicate that, ID $\textit{cat}$ images, compared with OOD images that are predicted into the $\textit{cat}$, will produce higher similarity to similar classes such as $\textit{lion, owl, manul}$. Our proposed SimLabel (detailed in Sec. \ref{['Sec:SimLabel']}) conducts OOD detection by utilizing image & class-label and image & similar-classes-label similarity.
Figure 5: Zero-shot OOD detection performance comparison on hard OOD detection tasks. Following MCM ming2022delving, we use the subsets of ImageNet-1kimagenet2009 for testing the performance of SimLabel on hard OOD detection task.
...and 5 more figures

SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models

TL;DR

Abstract

SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)