Table of Contents
Fetching ...

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

Yimu Wang, Evelien Riddell, Adrian Chow, Sean Sedwards, Krzysztof Czarnecki

TL;DR

A novel few-shot tuning framework, SUPREME, comprising biased prompts generation and image-text consistency modules that consistently outperforms existing VLM-based OOD detection methods and introduces a novel OOD score, S_{\textit{GMP}}$, leveraging uni- and cross-modal similarities.

Abstract

Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score $S_{\textit{GMP}}$, leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

TL;DR

A novel few-shot tuning framework, SUPREME, comprising biased prompts generation and image-text consistency modules that consistently outperforms existing VLM-based OOD detection methods and introduces a novel OOD score, S_{\textit{GMP}}$, leveraging uni- and cross-modal similarities.

Abstract

Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score , leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.

Paper Structure

This paper contains 21 sections, 1 theorem, 18 equations, 7 figures, 8 tables.

Key Result

Theorem 1

Assuming that the OOD data is not drawn from any ID distribution, we have, where $I_{\textit{ID}}$ and $I_{\textit{OOD}}$ are the image embeddings of ID and OOD samples. We omit multi-modal prototypes for clarity.

Figures (7)

  • Figure 1: Standard VLM-based OOD detection methods ming2022delvingmiyai_locoop_2023li_learning_2024 only utilize ID text prototypes ($\Diamond$) for identifying OOD samples. In comparison, suPreMe employs ID image prototypes ($\square$) to complement ID text prototypes, reducing the impact of the image-text modality gap liang2022mind and sharpening the boundary between ID and OOD data. "Img." and "Emb." represent image and embedding.
  • Figure 2: Overview of suPreMe. The two novel modules, i.e., BPG (biased prompts generation) and ITC (image-text consistency), are designed to minimize the modality gap. 1) During the few-shot fine-tuning stage, BPG generates image domain-biased prompts, conditioned on the estimated image domain bias and the mapped image embedding $m(I)$ for better image-text fusion. ITC minimizes the modality gap directly by the intra- and inter-modal losses ($\ell_{\textit{intra}}$ and $\ell_{\textit{inter}}$) with the text-to-image mapping $f_{t2i}(\cdot)$ and the image-to-text mapping $f_{i2t}(\cdot)$. 2) During inference, image prototypes are obtained by averaging each class's base ID image embeddings (\ref{['eq: image anchor']}). The proposed $S_{\textit{GMP}}$ (\ref{['eq: MMO']}) is calculated based on the maximum similarity between the multi-modal embeddings (the image embedding $I_t$ and mapped image embedding $I_t^{'}$) and ID multi-modal prototypes (i.e., text prototypes $\{P_{t,c}\}_{c\in[C]}$ and image prototypes $\{P_{i,c}\}_{c\in[C]}$). $S_{\textit{MCM}}$ refers to the MCM score ming2022delving.
  • Figure 3: Performance comparison on the length of prompts.
  • Figure 5: Comparison between $S_{\textit{MCM}}$ming2022delving and our $S_{\textit{GMP}}$ on ImageNet-1k (ID) to iNaturalist (OOD). The scores are multiplied by $100$ for better illustration. "KS" is the statistic from the Kolmogorov–Smirnov test. Higher KS statistic values indicate a greater difference between two distributions. Best viewed in color.
  • Figure 6: The impact of different sizes of fine-tuning data.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 1: Multi-modal Prototypes Increase Score Separation between ID and OOD Data
  • Remark
  • Remark
  • Remark
  • Remark