Table of Contents
Fetching ...

Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection

Pei-Kang Lee, Jun-Cheng Chen, Ja-Ling Wu

TL;DR

This work tackles zero-shot and few-shot out-of-distribution (OOD) detection by combining large language models (LLMs) and vision–language models (VLMs). It introduces a hierarchical strategy that uses LLM-generated superclass labels and background descriptions (Superclass-BG) to expand and refine the ID semantic space, enabling more representative negative labels drawn from WordNet. A two-phase training regime—prompt tuning (Phase-1) followed by visual prompt tuning (Phase-2)—adapts CLIP to target distributions, aided by ID-like OOD data to realize robust few-shot learning. Empirical results show consistent improvements over state-of-the-art methods across ImageNet-1K and OpenOOD v1.5 benchmarks, with up to 2.9% AUROC gains and up to 12.6% reductions in FPR95, plus strong covariate-shift robustness. The approach also provides nuanced insights into background description length and LLM selection, offering practical guidance for deploying zero-shot and few-shot OOD detection in real-world settings.

Abstract

Out-of-distribution (OOD) detection has seen significant advancements with zero-shot approaches by leveraging the powerful Vision-Language Models (VLMs) such as CLIP. However, prior research works have predominantly focused on enhancing Far-OOD performance, while potentially compromising Near-OOD efficacy, as observed from our pilot study. To address this issue, we propose a novel strategy to enhance zero-shot OOD detection performances for both Far-OOD and Near-OOD scenarios by innovatively harnessing Large Language Models (LLMs) and VLMs. Our approach first exploit an LLM to generate superclasses of the ID labels and their corresponding background descriptions followed by feature extraction using CLIP. We then isolate the core semantic features for ID data by subtracting background features from the superclass features. The refined representation facilitates the selection of more appropriate negative labels for OOD data from a comprehensive candidate label set of WordNet, thereby enhancing the performance of zero-shot OOD detection in both scenarios. Furthermore, we introduce novel few-shot prompt tuning and visual prompt tuning to adapt the proposed framework to better align with the target distribution. Experimental results demonstrate that the proposed approach consistently outperforms current state-of-the-art methods across multiple benchmarks, with an improvement of up to 2.9% in AUROC and a reduction of up to 12.6% in FPR95. Additionally, our method exhibits superior robustness against covariate shift across different domains, further highlighting its effectiveness in real-world scenarios.

Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection

TL;DR

This work tackles zero-shot and few-shot out-of-distribution (OOD) detection by combining large language models (LLMs) and vision–language models (VLMs). It introduces a hierarchical strategy that uses LLM-generated superclass labels and background descriptions (Superclass-BG) to expand and refine the ID semantic space, enabling more representative negative labels drawn from WordNet. A two-phase training regime—prompt tuning (Phase-1) followed by visual prompt tuning (Phase-2)—adapts CLIP to target distributions, aided by ID-like OOD data to realize robust few-shot learning. Empirical results show consistent improvements over state-of-the-art methods across ImageNet-1K and OpenOOD v1.5 benchmarks, with up to 2.9% AUROC gains and up to 12.6% reductions in FPR95, plus strong covariate-shift robustness. The approach also provides nuanced insights into background description length and LLM selection, offering practical guidance for deploying zero-shot and few-shot OOD detection in real-world settings.

Abstract

Out-of-distribution (OOD) detection has seen significant advancements with zero-shot approaches by leveraging the powerful Vision-Language Models (VLMs) such as CLIP. However, prior research works have predominantly focused on enhancing Far-OOD performance, while potentially compromising Near-OOD efficacy, as observed from our pilot study. To address this issue, we propose a novel strategy to enhance zero-shot OOD detection performances for both Far-OOD and Near-OOD scenarios by innovatively harnessing Large Language Models (LLMs) and VLMs. Our approach first exploit an LLM to generate superclasses of the ID labels and their corresponding background descriptions followed by feature extraction using CLIP. We then isolate the core semantic features for ID data by subtracting background features from the superclass features. The refined representation facilitates the selection of more appropriate negative labels for OOD data from a comprehensive candidate label set of WordNet, thereby enhancing the performance of zero-shot OOD detection in both scenarios. Furthermore, we introduce novel few-shot prompt tuning and visual prompt tuning to adapt the proposed framework to better align with the target distribution. Experimental results demonstrate that the proposed approach consistently outperforms current state-of-the-art methods across multiple benchmarks, with an improvement of up to 2.9% in AUROC and a reduction of up to 12.6% in FPR95. Additionally, our method exhibits superior robustness against covariate shift across different domains, further highlighting its effectiveness in real-world scenarios.
Paper Structure (28 sections, 11 equations, 4 figures, 5 tables)

This paper contains 28 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of the proposed zero-shot OOD detection and Superclass-BG negative label selection. We harness the capabilities of LLMs to select more representative negative labels.
  • Figure 2: Illustration of the proposed few-shot learning framework. The training dataset consists of a few-shot sample from the ImageNet-1K training set, while the OOD dataset is generated using an ID-like bai2024idlike approach. Here, the term positive labels refer to the class labels with learnable prompt.
  • Figure 3: Superclass-BG zero-shot performance metrics as a function of description length. Each subplot shows the trend of a specific OOD detection metric for description lengths ranging from 1 to 4. Data points represent average scores computed from 5-15 descriptions. The brown dashed line indicates the performances of NegLabel jiang2024neglabel.
  • Figure 4: Performance comparison on the OpenOOD V1.5 benchmark zhang2023openoodyang2022fsood. The evaluation contrasts the effectiveness of positive labels enhanced by original class labels against the baseline without enhancement.