Table of Contents
Fetching ...

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

TL;DR

This paper tackles OOD detection by addressing three major gaps in negative-label methods: lack of understanding of OOD images, poor near-OOD handling, and dependence on predefined task settings. It introduces ANTS, a training-free, zero-shot framework that uses test-time multimodal LLM reasoning to shape an adaptive negative textual space, comprising Expressive Negative Sentences (ENS) from mined negative images and Visually Similar Negative Labels (VSNL) for ID-class subsets near OOD. The two text spaces are balanced by an adaptive score, $S_{ada}({\bm{v}}) = \lambda S_{ens}({\bm{v}}) + (1-\lambda) S_{vsnl}({\bm{v}})$, with $\lambda$ determined by dataset- and data-driven expectations via $\lambda = F( \frac{1}{|\mathcal{X}_{neg}|} \sum S_{ens}, \frac{1}{|\mathcal{X}_{neg}|} \sum S_{vsnl} )$ and $F(a,b) = \frac{1-a}{(1-a)+(1-b)}$. Comprehensive experiments on ImageNet-based benchmarks show that ANTS achieves new state-of-the-art results in both near-OOD and far-OOD settings, while remaining zero-shot and training-free. The approach demonstrates strong scalability and robustness, though it relies on MLLM inference during testing, which has memory implications. Overall, ANTS offers a practical, adaptable strategy for open-environment OOD detection without requiring auxiliary outlier data.

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

TL;DR

This paper tackles OOD detection by addressing three major gaps in negative-label methods: lack of understanding of OOD images, poor near-OOD handling, and dependence on predefined task settings. It introduces ANTS, a training-free, zero-shot framework that uses test-time multimodal LLM reasoning to shape an adaptive negative textual space, comprising Expressive Negative Sentences (ENS) from mined negative images and Visually Similar Negative Labels (VSNL) for ID-class subsets near OOD. The two text spaces are balanced by an adaptive score, , with determined by dataset- and data-driven expectations via and . Comprehensive experiments on ImageNet-based benchmarks show that ANTS achieves new state-of-the-art results in both near-OOD and far-OOD settings, while remaining zero-shot and training-free. The approach demonstrates strong scalability and robustness, though it relies on MLLM inference during testing, which has memory implications. Overall, ANTS offers a practical, adaptable strategy for open-environment OOD detection without requiring auxiliary outlier data.

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

Paper Structure

This paper contains 13 sections, 14 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: T-SNE visualization of the ID and OOD image features, the text features of NegLabel jiang2023detecting, EOE cao2024envisioning, OOD ground-truth, and the expressive negative sentences (ENS) of ANTS. We select ImageNet and SUN as the ID and OOD datasets, respectively. NegLabel and EOE lack a good understanding of OOD images, resulting in a greater distance between the OOD images and the text features. In contrast, our ANTS utilizes the MLLMs to understand OOD images during ENS generation, reducing the distance between ENS and OOD images and improving OOD detection performance.
  • Figure 2: (a) Current MLLM improve their reasoning abilities by test time understanding and reasoning through chain-of-thought (CoT) prompting. (b) In our work, we leverage the test time understanding and reasoning capabilities of MLLM during inference to help visual-language models perform better on OOD detection.
  • Figure 3: The overall framework of our ANTS. ANTS framework consists of in three stages: (1) caching negative images and visually similar ID classes mined from historical test images; (2) shaping two negative textual spaces by prompting an MLLM with the cached data to generate expressive negative sentences and visually similar labels; and (3) performing online evaluation of the test image using an adaptively weighted combination of these textual spaces.
  • Figure 4: Expressive Negative Sentences, where $y_{i}$ represents the predicted ID label of the negative image.
  • Figure 5: Visually Similar Negative Labels, where $y_{i}$ represents the predicted ID label of the negative image.
  • ...and 2 more figures