Table of Contents
Fetching ...

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Jeonghyeon Kim, Sangheum Hwang

TL;DR

This work tackles out-of-distribution detection in open-world settings by addressing the modality gap in multi-modal vision-language models. It introduces Cross-Modal Alignment (CMA), a MMFT objective that jointly aligns image-text embeddings on a hypersphere and leverages pretrained textual knowledge through NegLabel-based scoring, linking the approach to energy-based models. Empirical results on ImageNet-1k-based MOS and OpenOOD v1.5 benchmarks show state-of-the-art OoDD performance and competitive ID accuracy, with CMA effectively reducing modality gap and enhancing cross-modal alignment. The findings highlight the value of fully exploiting textual priors in VLMs for robust OoDD and provide a principled, hyperspherical perspective on multi-modal representation learning.

Abstract

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naïve fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

TL;DR

This work tackles out-of-distribution detection in open-world settings by addressing the modality gap in multi-modal vision-language models. It introduces Cross-Modal Alignment (CMA), a MMFT objective that jointly aligns image-text embeddings on a hypersphere and leverages pretrained textual knowledge through NegLabel-based scoring, linking the approach to energy-based models. Empirical results on ImageNet-1k-based MOS and OpenOOD v1.5 benchmarks show state-of-the-art OoDD performance and competitive ID accuracy, with CMA effectively reducing modality gap and enhancing cross-modal alignment. The findings highlight the value of fully exploiting textual priors in VLMs for robust OoDD and provide a principled, hyperspherical perspective on multi-modal representation learning.

Abstract

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naïve fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

Paper Structure

This paper contains 29 sections, 10 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) illustrates the hyperspherical embedding space and the corresponding cosine similarity values between the "dog" image and "A photo of a $<\textit{label}>$" text embeddings. Initially, the embedding space shows a bipartite separation between images and texts (top) liang2022mindoh2024geodesicshi2023towards. Through CMA, ID images and texts are brought closer together while maintaining a clear separation from OoD texts (bottom). This alignment enhances the discriminability of ID data from negative concepts (i.e., OoD labels), thereby improving OoDD performance. In (b), uncolored shapes represent MCM, while colored shapes denote NegLabel. The arrows indicate the effect of NegLabel compared to MCM, demonstrating that our method enhances its effectiveness. Points closer to the top right indicate better ID accuracy and OoDD performance.
  • Figure 2: Visualization of DOSNES lu2019doubly on ImageNet-1k validation dataset and the MOS benchmark dataset. Blue and orange represent ID image and ID text embeddings, respectively, while green and red represent OoD image and OoD text embeddings. Additional visualizations are provided in the Appendix \ref{['sup:vis']}.
  • Figure 3: Pytorch-like pseudo-code of CMA
  • Figure 4: Visualization of image and text embeddings using PCA on ImageNet-1k. Orange and blue points represent ID image and ID text embeddings, respectively.
  • Figure 5: Visualization of image and text embeddings using PCA on ImageNet-1k and negative texts. Orange and blue points represent ID image and ID text embeddings, respectively, while red points denote negative text embeddings.
  • ...and 1 more figures