LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Yabin Zhang; Wenjie Zhu; Chenhang He; Lei Zhang

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Yabin Zhang, Wenjie Zhu, Chenhang He, Lei Zhang

TL;DR

The paper tackles OOD detection for real-world AI systems by leveraging Vision-Language Models (VLMs) and addresses the bottleneck of manual prompt engineering. It introduces Label-driven Automated Prompt Tuning (LAPT), which learns distribution-aware prompts using only in-distribution (ID) class names and automatically mined negative labels, with training data collected automatically via text-to-image generation and web-based retrieval. The training objective combines a vanilla cross-entropy loss with cross-modal and cross-distribution mixing, formalized as $\mathcal{L}_{all} = \mathcal{L} + \mathcal{L}_{cm} + \mathcal{L}_{cd}$, and LAPT achieves state-of-the-art performance on OpenOOD benchmarks, especially in near-OOD scenarios, while also improving ID accuracy and robustness to covariate shifts. Code and resources are provided to facilitate practical adoption and reproducibility.

Abstract

Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demands domain expertise and is sensitive to linguistic nuances. In this paper, we introduce Label-driven Automated Prompt Tuning (LAPT), a novel approach to OOD detection that reduces the need for manual prompt engineering. We develop distribution-aware prompts with in-distribution (ID) class names and negative labels mined automatically. Training samples linked to these class labels are collected autonomously via image synthesis and retrieval methods, allowing for prompt learning without manual effort. We utilize a simple cross-entropy loss for prompt optimization, with cross-modal and cross-distribution mixing strategies to reduce image noise and explore the intermediate space between distributions, respectively. The LAPT framework operates autonomously, requiring only ID class names as input and eliminating the need for manual intervention. With extensive experiments, LAPT consistently outperforms manually crafted prompts, setting a new standard for OOD detection. Moreover, LAPT not only enhances the distinction between ID and OOD samples, but also improves the ID classification accuracy and strengthens the generalization robustness to covariate shifts, resulting in outstanding performance in challenging full-spectrum OOD detection tasks. Codes are available at \url{https://github.com/YBZh/LAPT}.

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

TL;DR

, and LAPT achieves state-of-the-art performance on OpenOOD benchmarks, especially in near-OOD scenarios, while also improving ID accuracy and robustness to covariate shifts. Code and resources are provided to facilitate practical adoption and reproducibility.

Abstract

Paper Structure (14 sections, 12 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 12 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Methods
Problem Setup
Reviews on MCM and NegLabel
LAPT: Label-driven Automated Prompt Tuning
Distribution-aware prompts.
Automated sample collection with class labels.
Prompt tuning with cross-modal and cross-distribution mixing.
Experiments
Setup
Main Results
Analyses and Discussions
Conclusion and Limitations

Figures (5)

Figure 1: VLMs-based OOD detection systems are sensitive to linguistic nuances of text prompts, where results are reported on the OpenOOD dataset with a ViTB/16 image encoder. A lower FPR95 metric denotes better OOD detection performance.
Figure 2: The overall framework of our LAPT method, where $\boldsymbol{v}_{dog}/\boldsymbol{v}_{boat}$, $\boldsymbol{c}_{dog}/\boldsymbol{c}_{boat}$, and $\boldsymbol{l}_{dog}/\boldsymbol{l}_{boat}$ are image features, textual features, and soft labels of dog/boat samples.
Figure 3: Visualization of retrieved images (left), generated images (right), and the statistic of their cosine similarity to the class labels (middle). A larger mean indicates greater image-label consistency, and a larger standard deviation implies more diversity.
Figure 4: Analyses on prompt (a) types, (b) length, (b) initializations, and (d) positions.
Figure 5: Analyses on (a) image collection methods, (b) text-to-image generators, (c) data scale of the WebData space, and (d) number of training samples per class.

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

TL;DR

Abstract

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)