Table of Contents
Fetching ...

Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

Zhixia He, Chen Zhao, Minglai Shao, Xintao Wu, Xujiang Zhao, Dong Li, Qin Tian, Linlin Yu

TL;DR

This paper tackles out-of-distribution detection in open-world settings by leveraging vision-language models with a novel Positive and Negative Prompt Supervision (PNPS) framework. The method constructs class-specific positive and negative prompts via large language models, optimizes them with learnable textual and visual mappings, and uses a cross-modal graph to propagate semantic supervision from prompts to image features. Key contributions include a set of losses (PIR, PPD, NIR, NND, NPD) to align prompts with images and enforce inter-class boundaries, and a heterogeneous graph with ViG to fuse textual priors into the visual pathway, achieving state-of-the-art AUROC gains on CIFAR-100 and ImageNet-1K across multiple OOD benchmarks and LLMs. The approach demonstrates that prompting-based semantic enrichment, when coupled with graph-based cross-modal propagation, offers robust improvements for energy-based OOD detectors and practical benefits for real-world open-world vision tasks.

Abstract

Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.

Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

TL;DR

This paper tackles out-of-distribution detection in open-world settings by leveraging vision-language models with a novel Positive and Negative Prompt Supervision (PNPS) framework. The method constructs class-specific positive and negative prompts via large language models, optimizes them with learnable textual and visual mappings, and uses a cross-modal graph to propagate semantic supervision from prompts to image features. Key contributions include a set of losses (PIR, PPD, NIR, NND, NPD) to align prompts with images and enforce inter-class boundaries, and a heterogeneous graph with ViG to fuse textual priors into the visual pathway, achieving state-of-the-art AUROC gains on CIFAR-100 and ImageNet-1K across multiple OOD benchmarks and LLMs. The approach demonstrates that prompting-based semantic enrichment, when coupled with graph-based cross-modal propagation, offers robust improvements for energy-based OOD detectors and practical benefits for real-world open-world vision tasks.

Abstract

Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.

Paper Structure

This paper contains 26 sections, 17 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of the CLIP score using prompts and an image of a tiger. Compared to the prompt "a photo of a tiger", positive prompts enriched with visual features significantly enhances the CLIP score. Furthermore, the introduction of negative features can further increase the score.
  • Figure 2: An overview of the PNPS framework. To begin with, we employ LLMs to generate discriminative features, which are then filled into templates to construct class-specific positive and negative prompts. These prompts, together with image patches and full images, are subsequently encoded into their respective representations. To enhance the expressiveness of the prompt representations, we introduce learnable textual and visual parameter matrices, $\textbf{W}^T$ and $\textbf{W}^I$, for further optimization. Building on these optimized textual representations, we then construct cross-modal graph connections to aggregate semantic supervision from the prompts and propagate it to the visual branch, thereby improving the performance of image OOD detection.
  • Figure 3: T-SNE visualization of optimized image features, along with positive and negative prompt features on the CIFAR-10 dataset in the shared semantic space.
  • Figure 4: Performance in terms of AUROC, AUPR, and FPR95 under different settings on CIFAR-100 dataset.
  • Figure 5: Grad-CAM maps visualization of features captured by full, without positive, and without negative prompts.
  • ...and 1 more figures