Table of Contents
Fetching ...

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Simon Schrodi, David T. Hoffmann, Max Argus, Volker Fischer, Thomas Brox

TL;DR

The paper dissects two perplexing properties of contrastive vision-language models—the modality gap and object bias—through large-scale empirical analysis and controlled synthetic experiments. It introduces MOAD and BRACE-inspired perspectives and demonstrates that information imbalance between images and captions is the root cause, driving both phenomena and affecting logit entropy. Crucially, it shows that removing or reducing the information imbalance decreases both the gap and object bias and can improve downstream performance, while post-hoc gap closing alone does not guarantee gains. The work reframes the modality gap as a feature that affords entropy control and provides practical guidance for data enrichment and filtering to mitigate bias and improve cross-modal alignment.

Abstract

Contrastive vision-language models (VLMs), like CLIP, have gained popularity for their versatile applicability to various downstream tasks. Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes. In this analysis paper, we investigate both phenomena thoroughly. We evaluated off-the-shelf VLMs and while the gap's influence on performance is typically overshadowed by other factors, we find indications that closing the gap indeed leads to improvements. Moreover, we find that, contrary to intuition, only few embedding dimensions drive the gap and that the embedding spaces are differently organized. To allow for a clean study of object bias, we introduce a definition and a corresponding measure of it. Equipped with this tool, we find that object bias does not lead to worse performance on other concepts, such as attributes per se. However, why do both phenomena, modality gap and object bias, emerge in the first place? To answer this fundamental question and uncover some of the inner workings of contrastive VLMs, we conducted experiments that allowed us to control the amount of shared information between the modalities. These experiments revealed that the driving factor behind both the modality gap and the object bias, is an information imbalance between images and captions, and unveiled an intriguing connection between the modality gap and entropy of the logits.

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

TL;DR

The paper dissects two perplexing properties of contrastive vision-language models—the modality gap and object bias—through large-scale empirical analysis and controlled synthetic experiments. It introduces MOAD and BRACE-inspired perspectives and demonstrates that information imbalance between images and captions is the root cause, driving both phenomena and affecting logit entropy. Crucially, it shows that removing or reducing the information imbalance decreases both the gap and object bias and can improve downstream performance, while post-hoc gap closing alone does not guarantee gains. The work reframes the modality gap as a feature that affords entropy control and provides practical guidance for data enrichment and filtering to mitigate bias and improve cross-modal alignment.

Abstract

Contrastive vision-language models (VLMs), like CLIP, have gained popularity for their versatile applicability to various downstream tasks. Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes. In this analysis paper, we investigate both phenomena thoroughly. We evaluated off-the-shelf VLMs and while the gap's influence on performance is typically overshadowed by other factors, we find indications that closing the gap indeed leads to improvements. Moreover, we find that, contrary to intuition, only few embedding dimensions drive the gap and that the embedding spaces are differently organized. To allow for a clean study of object bias, we introduce a definition and a corresponding measure of it. Equipped with this tool, we find that object bias does not lead to worse performance on other concepts, such as attributes per se. However, why do both phenomena, modality gap and object bias, emerge in the first place? To answer this fundamental question and uncover some of the inner workings of contrastive VLMs, we conducted experiments that allowed us to control the amount of shared information between the modalities. These experiments revealed that the driving factor behind both the modality gap and the object bias, is an information imbalance between images and captions, and unveiled an intriguing connection between the modality gap and entropy of the logits.
Paper Structure (35 sections, 12 equations, 16 figures, 10 tables)

This paper contains 35 sections, 12 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Illustration of information imbalance between images (top left) and captions (bottom left). This imbalance makes it even for an oracle image encoder virtually impossible to predict the content of a caption, leading to undesirable effects in contrastive training, such as the modality gap and object bias (see \ref{['sec:miniCLIP']}).
  • Figure 2: Examples from MAD.
  • Figure 3: Relation between modality gap (L2M & RMG, larger value $\rightarrow$ larger gap) and downstream performance for a total of 98 contrastive vlm pre-trained on medium- and large-scale datasets (each scatter point is a vlm). The plots indicate no to weak positive correlations between performance and modality gap (see the numbers in \ref{['tab:performance_correlations']}).
  • Figure 4: Few embedding dimensions separate the modalities. Results on MS-COCO. (\ref{['fig:mean_difference']}) We plot the absolute difference in the means of each embedding dimension between the modalities. Most dimensions have similar means for both modalities, but for some the differences are huge. (\ref{['fig:separability']}) Pairs of these high difference dimensions can perfectly separate the modalities (we show the ones with largest mean for each modality). (\ref{['fig:ablated_dims']}) Successive removal of embedding dimensions based on the sorting of embedding dimensions from (\ref{['fig:mean_difference']}) leads to a sharp drop, followed by a partial recovery of downstream performance, while the modality gap gradually closes (similar results for L2M). See \ref{['sub:few_embeds_appendix']} for results on ImageNet and the plots in (\ref{['fig:separability']}) with the largest two dimensions of (\ref{['fig:mean_difference']}).
  • Figure 5: Object bias and performance on attribute tasks. (\ref{['fig:object_bias_vs_performance']}) We find a bias towards objects (positive MOAD values) but no correlation with attribute performance. We attribute this to the (\ref{['fig:obj_vs_attr_perf']}) positive correlation between performance improvements on object tasks and attribute tasks.
  • ...and 11 more figures