Table of Contents
Fetching ...

Negative Label Guided OOD Detection with Pretrained Vision-Language Models

Xue Jiang, Feng Liu, Zhen Fang, Hong Chen, Tongliang Liu, Feng Zheng, Bo Han

TL;DR

This work tackles zero-shot OOD detection in vision-language models by enlarging the label space with a large set of negative labels sourced from broad corpora. It introduces NegLabel, a post hoc detector that uses NegMining to select discriminative negative labels and a sum-softmax OOD score that fuses affinities to ID labels and negative labels, with a grouping strategy to reduce variance. The authors provide a theoretical analysis linking negative labels to improved separability between ID and OOD via a multi-label framework and normal approximation, and demonstrate state-of-the-art performance across CLIP-like and other VLM architectures while showing robustness to domain shifts. The method is simple to deploy, does not require fine-tuning, and has practical implications for safer deployment of VLMs in open-world settings.

Abstract

Out-of-distribution (OOD) detection aims at identifying samples from unknown classes, playing a crucial role in trustworthy models against errors on unexpected inputs. Extensive research has been dedicated to exploring OOD detection in the vision modality. Vision-language models (VLMs) can leverage both textual and visual information for various multi-modal applications, whereas few OOD detection methods take into account information from the text modality. In this paper, we propose a novel post hoc OOD detection method, called NegLabel, which takes a vast number of negative labels from extensive corpus databases. We design a novel scheme for the OOD score collaborated with negative labels. Theoretical analysis helps to understand the mechanism of negative labels. Extensive experiments demonstrate that our method NegLabel achieves state-of-the-art performance on various OOD detection benchmarks and generalizes well on multiple VLM architectures. Furthermore, our method NegLabel exhibits remarkable robustness against diverse domain shifts. The codes are available at https://github.com/tmlr-group/NegLabel.

Negative Label Guided OOD Detection with Pretrained Vision-Language Models

TL;DR

This work tackles zero-shot OOD detection in vision-language models by enlarging the label space with a large set of negative labels sourced from broad corpora. It introduces NegLabel, a post hoc detector that uses NegMining to select discriminative negative labels and a sum-softmax OOD score that fuses affinities to ID labels and negative labels, with a grouping strategy to reduce variance. The authors provide a theoretical analysis linking negative labels to improved separability between ID and OOD via a multi-label framework and normal approximation, and demonstrate state-of-the-art performance across CLIP-like and other VLM architectures while showing robustness to domain shifts. The method is simple to deploy, does not require fine-tuning, and has practical implications for safer deployment of VLMs in open-world settings.

Abstract

Out-of-distribution (OOD) detection aims at identifying samples from unknown classes, playing a crucial role in trustworthy models against errors on unexpected inputs. Extensive research has been dedicated to exploring OOD detection in the vision modality. Vision-language models (VLMs) can leverage both textual and visual information for various multi-modal applications, whereas few OOD detection methods take into account information from the text modality. In this paper, we propose a novel post hoc OOD detection method, called NegLabel, which takes a vast number of negative labels from extensive corpus databases. We design a novel scheme for the OOD score collaborated with negative labels. Theoretical analysis helps to understand the mechanism of negative labels. Extensive experiments demonstrate that our method NegLabel achieves state-of-the-art performance on various OOD detection benchmarks and generalizes well on multiple VLM architectures. Furthermore, our method NegLabel exhibits remarkable robustness against diverse domain shifts. The codes are available at https://github.com/tmlr-group/NegLabel.
Paper Structure (38 sections, 22 equations, 6 figures, 21 tables, 4 algorithms)

This paper contains 38 sections, 22 equations, 6 figures, 21 tables, 4 algorithms.

Figures (6)

  • Figure 1: Overview of NegLabel. The image encoder extracts input images into image embeddings $\bm{\mathit{h}}$. The text encoder extracts ID and negative labels into text embeddings $\bm{\mathit{e}}$. All encoders are frozen in the inference time. The negative labels are selected by NegMining (see \ref{['sec:neg_select']}) from a large-scale corpus. The image-text similarities are quantified through $\bm{\mathit{h}} \cdot \bm{\mathit{e}}$, represented by purple blocks (darker shades indicating higher similarity). The right part illustrates that ID images tend to produce lower similarities with neagative labels than OOD images. Our NegLabel score (see \ref{['sec:score']}) fuses the similarities of image-ID labels (green) and image-negative labels (blue).
  • Figure 2: Illustration of NegMining. The algorithm selects negative labels with larger distances (lower similarities) from the ID labels. Darker blue squares represent the higher priorities to be picked. Dashed squares represent negative labels that are impossible to be selected.
  • Figure 2: Zero-shot OOD detection performance comparison on hard OOD detection tasks.
  • Figure 3: Zero-shot OOD detection performance robustness to domain shift.
  • Figure 4: Case visualization of ID images. The left part of each subfigure contains the original image, its filename and its dataset name. The right part shows the softmax-normalized affinities among ID (blue) and negative (green) labels.
  • ...and 1 more figures