Generalized Contrastive Learning for Universal Multimodal Retrieval
Jungsoo Lee, Janghoon Cho, Hyojin Park, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi
TL;DR
This work tackles the modality gap in universal multimodal retrieval by introducing Generalized Contrastive Learning (GCL), a loss that jointly contrasts image ($i$), text ($t$), and fused ($it$) embeddings within each mini-batch to learn a unified representation space. GCL leverages off-the-shelf image-caption data and defines a six-way modality pairing with a normalization over $6N$, enabling retrieval across all modality combinations without expensive triplet datasets. Empirical results show consistent improvements across diverse backbones (VISTA, CLIP, TinyCLIP) and benchmarks (M-BEIR, MMEB, CoVR), including gains in zero-shot settings and for lightweight models. The approach reduces reliance on curated datasets while delivering robust performance across image-only, text-only, and fused retrieval tasks, with potential for future integration into multimodal large language models and information-driven generation.
Abstract
Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.
