ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
TL;DR
This paper tackles OOD detection by addressing three major gaps in negative-label methods: lack of understanding of OOD images, poor near-OOD handling, and dependence on predefined task settings. It introduces ANTS, a training-free, zero-shot framework that uses test-time multimodal LLM reasoning to shape an adaptive negative textual space, comprising Expressive Negative Sentences (ENS) from mined negative images and Visually Similar Negative Labels (VSNL) for ID-class subsets near OOD. The two text spaces are balanced by an adaptive score, $S_{ada}({\bm{v}}) = \lambda S_{ens}({\bm{v}}) + (1-\lambda) S_{vsnl}({\bm{v}})$, with $\lambda$ determined by dataset- and data-driven expectations via $\lambda = F( \frac{1}{|\mathcal{X}_{neg}|} \sum S_{ens}, \frac{1}{|\mathcal{X}_{neg}|} \sum S_{vsnl} )$ and $F(a,b) = \frac{1-a}{(1-a)+(1-b)}$. Comprehensive experiments on ImageNet-based benchmarks show that ANTS achieves new state-of-the-art results in both near-OOD and far-OOD settings, while remaining zero-shot and training-free. The approach demonstrates strong scalability and robustness, though it relies on MLLM inference during testing, which has memory implications. Overall, ANTS offers a practical, adaptable strategy for open-environment OOD detection without requiring auxiliary outlier data.
Abstract
The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.
