Table of Contents
Fetching ...

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang

TL;DR

The InterNeg proposes InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives, and achieves state-of-the-art performance compared to existing works.

Abstract

Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

TL;DR

The InterNeg proposes InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives, and achieves state-of-the-art performance compared to existing works.

Abstract

Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.
Paper Structure (46 sections, 12 equations, 5 figures, 13 tables, 2 algorithms)

This paper contains 46 sections, 12 equations, 5 figures, 13 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of Baseline and InterNeg. The baseline often incorporates intra-modal distance during OOD detection, which is inconsistent with the inter-modal distance that CLIP-like VLMs are optimized for. In contrast, InterNeg leverages consistent inter-modal distance during OOD detection, enhancing performance by inter-modal guided negative texts and extra negative text embeddings generated through modality inversion.
  • Figure 2: Two types of ID misclassification. First Row: Max-OOD dominant ID misclassification. Second Row: Sum-OOD dominant ID misclassification. Left: Original ID image from ImageNet-1K with its class label and filename. Middle: Top-5 softmax scores for ID labels and negative texts of baseline and our method. Right: Max-OOD/Sum-OOD dominant ID error rates under different thresholds $\gamma$ of baseline and our method.
  • Figure 3: AUROC $\uparrow$ and FPR95 $\downarrow$ average performance under varying ID:OOD ratios.
  • Figure 4: Parameter sensitivity analysis of four key hyperparameters: the number of ID images per class $N$, the number of selected negative texts $M$, the maximum size of extra negative text embeddings set $K$, and the high-confidence OOD threshold $\beta$, evaluated on both Four-OOD and Near-OOD benchmarks using ImageNet-1K as the ID dataset.
  • Figure 5: Max-OOD and Sum-OOD ID error rates on different OOD datasets. Left: iNaturalist. Middle: Places. Right: Textures.