Table of Contents
Fetching ...

TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

Jinglun Li, Xinyu Zhou, Kaixun Jiang, Lingyi Hong, Pinxue Guo, Zhaoyu Chen, Weifeng Ge, Wenqiang Zhang

TL;DR

TagOOD addresses OOD detection by decoupling image content from its labels using vision-language tagging, then learning object-level IND class centers in a common feature space via a lightweight projection model. The OOD score is computed as a cosine-based distance between test features and learned centers, enabling robust discrimination even when backgrounds or similar objects confound IND representations. The approach combines image feature decomposition with EMA-updated class centers trained through a joint CE and MSE loss, and it demonstrates strong AUROC and FPR95 results on ImageNet-1K and several challenging OOD benchmarks, while remaining robust to feature-selection choices and distance metrics. This work highlights the value of multimodal information fusion for reliable OOD detection and suggests avenues for applying vision-language representations to related tasks.

Abstract

Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.

TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

TL;DR

TagOOD addresses OOD detection by decoupling image content from its labels using vision-language tagging, then learning object-level IND class centers in a common feature space via a lightweight projection model. The OOD score is computed as a cosine-based distance between test features and learned centers, enabling robust discrimination even when backgrounds or similar objects confound IND representations. The approach combines image feature decomposition with EMA-updated class centers trained through a joint CE and MSE loss, and it demonstrates strong AUROC and FPR95 results on ImageNet-1K and several challenging OOD benchmarks, while remaining robust to feature-selection choices and distance metrics. This work highlights the value of multimodal information fusion for reliable OOD detection and suggests avenues for applying vision-language representations to related tasks.

Abstract

Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.
Paper Structure (17 sections, 5 equations, 5 figures, 5 tables)

This paper contains 17 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A toy sample illustrates a challenge in OOD detection using a 3D feature space. The "Bamboo" and "Withered grass" features in the IND image are close to the OOD features in this space. This can cause the model to misclassify the IND image as containing objects similar to the OOD image.
  • Figure 2: The TagOOD pipeline for OOD detection consists of two main stages, illustrated in (a) and (b). First, image feature decomposition (see (a) on the left) leverages a vision-language model to generate multiple tags. The model then identifies tags belonging to the IND category and creates corresponding attention masks within the image. Next, as shown in (b), the extracted IND object features are used to train a lightweight network that produces a set of IND class centers. During inference (referring back to (a)), TagOOD computes a distance-based metric between IND class centers and test sample features as the OOD score.
  • Figure 3: The performance of TagOOD training with varying values of the hyperparameter $\tau$.
  • Figure 4: Illustration of object features visualization and OOD score distribution. Both object features are visualized by T-SNE van2008visualizing. Data points represent object features, and colors encode their corresponding IND class labels. Gray X marks indicate OOD data points. The features presented on the left are directly extracted from the tagging model before projection. Following the projection process, the object features become more condensed, as demonstrated on the right.
  • Figure 5: Results of evaluation on varying levels of features selected during image feature decomposition.