Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection
Pei-Kang Lee, Jun-Cheng Chen, Ja-Ling Wu
TL;DR
This work tackles zero-shot and few-shot out-of-distribution (OOD) detection by combining large language models (LLMs) and vision–language models (VLMs). It introduces a hierarchical strategy that uses LLM-generated superclass labels and background descriptions (Superclass-BG) to expand and refine the ID semantic space, enabling more representative negative labels drawn from WordNet. A two-phase training regime—prompt tuning (Phase-1) followed by visual prompt tuning (Phase-2)—adapts CLIP to target distributions, aided by ID-like OOD data to realize robust few-shot learning. Empirical results show consistent improvements over state-of-the-art methods across ImageNet-1K and OpenOOD v1.5 benchmarks, with up to 2.9% AUROC gains and up to 12.6% reductions in FPR95, plus strong covariate-shift robustness. The approach also provides nuanced insights into background description length and LLM selection, offering practical guidance for deploying zero-shot and few-shot OOD detection in real-world settings.
Abstract
Out-of-distribution (OOD) detection has seen significant advancements with zero-shot approaches by leveraging the powerful Vision-Language Models (VLMs) such as CLIP. However, prior research works have predominantly focused on enhancing Far-OOD performance, while potentially compromising Near-OOD efficacy, as observed from our pilot study. To address this issue, we propose a novel strategy to enhance zero-shot OOD detection performances for both Far-OOD and Near-OOD scenarios by innovatively harnessing Large Language Models (LLMs) and VLMs. Our approach first exploit an LLM to generate superclasses of the ID labels and their corresponding background descriptions followed by feature extraction using CLIP. We then isolate the core semantic features for ID data by subtracting background features from the superclass features. The refined representation facilitates the selection of more appropriate negative labels for OOD data from a comprehensive candidate label set of WordNet, thereby enhancing the performance of zero-shot OOD detection in both scenarios. Furthermore, we introduce novel few-shot prompt tuning and visual prompt tuning to adapt the proposed framework to better align with the target distribution. Experimental results demonstrate that the proposed approach consistently outperforms current state-of-the-art methods across multiple benchmarks, with an improvement of up to 2.9% in AUROC and a reduction of up to 12.6% in FPR95. Additionally, our method exhibits superior robustness against covariate shift across different domains, further highlighting its effectiveness in real-world scenarios.
