Table of Contents
Fetching ...

Generalized Contrastive Learning for Universal Multimodal Retrieval

Jungsoo Lee, Janghoon Cho, Hyojin Park, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

TL;DR

This work tackles the modality gap in universal multimodal retrieval by introducing Generalized Contrastive Learning (GCL), a loss that jointly contrasts image ($i$), text ($t$), and fused ($it$) embeddings within each mini-batch to learn a unified representation space. GCL leverages off-the-shelf image-caption data and defines a six-way modality pairing with a normalization over $6N$, enabling retrieval across all modality combinations without expensive triplet datasets. Empirical results show consistent improvements across diverse backbones (VISTA, CLIP, TinyCLIP) and benchmarks (M-BEIR, MMEB, CoVR), including gains in zero-shot settings and for lightweight models. The approach reduces reliance on curated datasets while delivering robust performance across image-only, text-only, and fused retrieval tasks, with potential for future integration into multimodal large language models and information-driven generation.

Abstract

Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.

Generalized Contrastive Learning for Universal Multimodal Retrieval

TL;DR

This work tackles the modality gap in universal multimodal retrieval by introducing Generalized Contrastive Learning (GCL), a loss that jointly contrasts image (), text (), and fused () embeddings within each mini-batch to learn a unified representation space. GCL leverages off-the-shelf image-caption data and defines a six-way modality pairing with a normalization over , enabling retrieval across all modality combinations without expensive triplet datasets. Empirical results show consistent improvements across diverse backbones (VISTA, CLIP, TinyCLIP) and benchmarks (M-BEIR, MMEB, CoVR), including gains in zero-shot settings and for lightweight models. The approach reduces reliance on curated datasets while delivering robust performance across image-only, text-only, and fused retrieval tasks, with potential for future integration into multimodal large language models and information-driven generation.

Abstract

Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.

Paper Structure

This paper contains 24 sections, 2 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of GCL. Given an embedding model pretrained for cross-modal alignment, previous studies (e.g., VISTA vista) constructed new triplet datasets to simulate specific multimodal retrieval scenarios. However, this approach limits generalization to unseen retrieval scenarios (white squares). In contrast, GCL improves retrieval performance across diverse scenarios (black squares). Specifically, by utilizing off-the-shelf image-caption datasets, GCL enables the learning of retrieval tasks involving nine different modality combinations.
  • Figure 2: PCA visualization of representation spaces using $e_i$, $e_t$, and $e_{it}$. We use MSCOCO for $e_i$ (red) and $e_t$ (blue) and WebQA for $e_{it}$ (green). We sampled 2K samples from each modality, using 6K samples in total. $\overline{e}$ indicates the average embedding vector of each modality.
  • Figure 3: Training process of GCL. Given a dataset composed of image-caption pairs, we extract $e_{i}$, $e_{t}$, and $e_{it}$. For $e_{it}$, we follow the extraction method used by the retrieval model (e.g., VISTA and CLIP-SF). Then, we integrate samples of the three different modalities into a single mini-batch for contrastive learning. We mask out the supervision on the positive samples with identical modalities.
  • Figure 4: Rankings of ground truth candidates. The x-axis and y-axis indicate the ranks and the frequency of ranks, respectively. We use the task of $q_t$→$c_{i}$ on the MSCOCO dataset, with a candidate pool composed of $c_{i}$ and $c_{t}$ from MSCOCO.
  • Figure 5: (a) Cosine similarity between query and ground truth candidates. X-axis and y-axis indicates the dataset and cosine similarity, respectively. VisualN. refers to VisualNews. (b) Cosine similarity between queries and top-ranked candidates. We use MSCOCO for the task of $q_i$→$c_{t}$.