Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Jeonghyeon Kim, Sangheum Hwang
TL;DR
This work tackles out-of-distribution detection in open-world settings by addressing the modality gap in multi-modal vision-language models. It introduces Cross-Modal Alignment (CMA), a MMFT objective that jointly aligns image-text embeddings on a hypersphere and leverages pretrained textual knowledge through NegLabel-based scoring, linking the approach to energy-based models. Empirical results on ImageNet-1k-based MOS and OpenOOD v1.5 benchmarks show state-of-the-art OoDD performance and competitive ID accuracy, with CMA effectively reducing modality gap and enhancing cross-modal alignment. The findings highlight the value of fully exploiting textual priors in VLMs for robust OoDD and provide a principled, hyperspherical perspective on multi-modal representation learning.
Abstract
Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naïve fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.
