Table of Contents
Fetching ...

I0T: Embedding Standardization Method Towards Zero Modality Gap

Na Min An, Eunki Kim, James Thorne, Hyunjung Shim

TL;DR

The paper tackles the modality gap in CLIP-like vision-language models by introducing I0T, a two-stage framework that first enhances semantic representations and then reduces modality-specific discrepancies via post-hoc embedding standardization (I0T_post) or learnable per-encoder batch normalization (I0T_async). I0T_post can drive the gap to near zero and enables an automatic reference-free evaluation metric, I0T-S, while I0T_async provides a strong, practical alternative with competitive downstream performance. The work shows that removing modality-specific activation patterns via normalization can dramatically align image and text embeddings without destroying semantic content, offering a practical path to more reliable cross-modal retrieval and evaluation. These findings have potential impact on multimodal benchmarks and real-world systems requiring robust, explainable cross-modal similarity measures.

Abstract

Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, $\text{I0T}_{\text{post}}$ that reduces the modality gap approximately to zero and (2) a trainable method, $\text{I0T}_{\text{async}}$, to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, $\text{I0T}_{\text{post}}$ can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S).

I0T: Embedding Standardization Method Towards Zero Modality Gap

TL;DR

The paper tackles the modality gap in CLIP-like vision-language models by introducing I0T, a two-stage framework that first enhances semantic representations and then reduces modality-specific discrepancies via post-hoc embedding standardization (I0T_post) or learnable per-encoder batch normalization (I0T_async). I0T_post can drive the gap to near zero and enables an automatic reference-free evaluation metric, I0T-S, while I0T_async provides a strong, practical alternative with competitive downstream performance. The work shows that removing modality-specific activation patterns via normalization can dramatically align image and text embeddings without destroying semantic content, offering a practical path to more reliable cross-modal retrieval and evaluation. These findings have potential impact on multimodal benchmarks and real-world systems requiring robust, explainable cross-modal similarity measures.

Abstract

Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, that reduces the modality gap approximately to zero and (2) a trainable method, , to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S).

Paper Structure

This paper contains 36 sections, 3 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Refined scoring system using our proposal (I0T-S) than CLIP-S. I0T-S assigns a higher similarity score for the correct image-text pair than irrelevant pairs.
  • Figure 2: Linear separability and minimum cosine distance (dashed line) vs. centroid distance illustrated with corresponding 3D-projected embeddings. The embeddings are categorized by three modality gap severity levels: severe, moderate, and low.
  • Figure 3: Comparison of normalized embedding activations (avg: salmon, std: gray) and modality gap across three post-hoc methods applied on Long-CLIP.
  • Figure 4: Comparison of non-CLIP-based model BLIP and ours on the efficiency and performances.
  • Figure 5: A wider range of cosine similarity distribution with mean close to 0 using I0T-S compared to CLIP-S and PAC-S without the scaling factor (i.e., $\omega=1$), contributing to more explainable similarity scores for positive and negative pair of image and caption.
  • ...and 3 more figures