Table of Contents
Fetching ...

Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP

Sepideh Esmaeilpour, Bing Liu, Eric Robertson, Lei Shu

TL;DR

This work tackles zero-shot out-of-distribution detection by leveraging a pre-trained CLIP model augmented with an image description generator. At inference, the method generates candidate unseen labels for each test image and computes an OOD confidence score from the similarity of the image to both seen and generated labels: $S(x)=1-\sum_{y\in\mathcal{Y}_s} P(y|x)$. Experiments on five benchmark dataset splits show that the proposed ZOC method substantially outperforms strong supervised baselines and CLIP-based MSP, demonstrating the value of dynamic unseen-label reasoning for robust OOD detection. The approach enables effective OOD detection without any unseen-class training data, advancing safety in real-world deployments using large multimodal pre-trained models.

Abstract

In an out-of-distribution (OOD) detection problem, samples of known classes(also called in-distribution classes) are used to train a special classifier. In testing, the classifier can (1) classify the test samples of known classes to their respective classes and also (2) detect samples that do not belong to any of the known classes (i.e., they belong to some unknown or OOD classes). This paper studies the problem of zero-shot out-of-distribution(OOD) detection, which still performs the same two tasks in testing but has no training except using the given known class names. This paper proposes a novel yet simple method (called ZOC) to solve the problem. ZOC builds on top of the recent advances in zero-shot classification through multi-modal representation learning. It first extends the pre-trained language-vision model CLIP by training a text-based image description generator on top of CLIP. In testing, it uses the extended model to generate candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot OOD detection. Experimental results on 5 benchmark datasets for OOD detection demonstrate that ZOC outperforms the baselines by a large margin.

Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP

TL;DR

This work tackles zero-shot out-of-distribution detection by leveraging a pre-trained CLIP model augmented with an image description generator. At inference, the method generates candidate unseen labels for each test image and computes an OOD confidence score from the similarity of the image to both seen and generated labels: . Experiments on five benchmark dataset splits show that the proposed ZOC method substantially outperforms strong supervised baselines and CLIP-based MSP, demonstrating the value of dynamic unseen-label reasoning for robust OOD detection. The approach enables effective OOD detection without any unseen-class training data, advancing safety in real-world deployments using large multimodal pre-trained models.

Abstract

In an out-of-distribution (OOD) detection problem, samples of known classes(also called in-distribution classes) are used to train a special classifier. In testing, the classifier can (1) classify the test samples of known classes to their respective classes and also (2) detect samples that do not belong to any of the known classes (i.e., they belong to some unknown or OOD classes). This paper studies the problem of zero-shot out-of-distribution(OOD) detection, which still performs the same two tasks in testing but has no training except using the given known class names. This paper proposes a novel yet simple method (called ZOC) to solve the problem. ZOC builds on top of the recent advances in zero-shot classification through multi-modal representation learning. It first extends the pre-trained language-vision model CLIP by training a text-based image description generator on top of CLIP. In testing, it uses the extended model to generate candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot OOD detection. Experimental results on 5 benchmark datasets for OOD detection demonstrate that ZOC outperforms the baselines by a large margin.

Paper Structure

This paper contains 14 sections, 3 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: The diagram illustrates the inference steps of ZOC for a sample from an unseen class 'boat'. The available seen class labels (shown in green) are $\mathcal{Y}_s$={'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog'}. In the first step, the image is encoded through CLIPimage and then image description is generated in the output of Decodertext. The description is in fact a set of candidate unseen labels $\mathcal{Y}_u$ (shown in orange). In the second step, $\mathcal{Y}_s \cup \mathcal{Y}_u$ are encoded through CLIPtext on the right. The purple ellipsoid shows CLIP's feature space where the relevant labels are aligned with the image. CLIP quantifies the alignment by calculating the cosine similarity of each encoded label to the encoded image. Then $S(x)$ is obtained according to \ref{['conf_score']}. The score is high for this image as it is more similar to the set of $\mathcal{Y}_u$ than $\mathcal{Y}_s$. The inference relies on CLIP pre-trained encoders as well as $\mathcal{Y}_u$ generated by Decodertext (Best viewed in color).
  • Figure 2: (a) A summary of the generated labels for a seen class 'espresso' and an unseen class 'guacamole' from the TinyImagenet dataset are shown. The generated labels are ranked based on their contribution to $S(x)$. The labels with ($P(y|x)>0.1$) are in boldface. For class 'espresso', we expect the model to output a relatively low $S(x)$ as the actual label is present among seen labels (first two images). The third image is an error case. The set of generated labels and the label 'coffee' produce a high $S(x)$. For the unseen class 'guacamole', $S(x)$ is high for the first two images as expected since ZOC correctly associates the generated labels with the images. The third image is again an error case when a seen label 'frying pan' contributes to $S(x)$ more than the generated unseen labels. (Best viewed in color) (b) 20 seen labels form tinyimagenet are listed at the top. 4 classes 'skirt', 'teddy', 'tractor' and 'koala' are a subset of unseen classes. Each subplot shows the histogram of the confidence score $S(x)$. For instance, in the histogram for unseen class 'skirt', we can clearly see that more than 40 samples have confidence scores between 0.8 and 1 which is desirable for good detection performance. S(x) tends to have a higher variance for the other 3 unseen class plots. It is interesting to note that for the samples from class 'tractor', the confidence score is relatively low because it is confused with semantically similar seen labels 'school bus' and 'go-kart'. Similarly, low $S(x)$ for 'koala' mostly happen when the model associates the image with seen labels 'orangutan' and 'German shepherd'. The confidence score for two seen classes 'school bus' and 'vestment' is distributed in lower ranges as expected. (Best viewed in color)