Table of Contents
Fetching ...

Cross-Modal Coordination Across a Diverse Set of Input Modalities

Jorge Sánchez, Rodrigo Laguna

TL;DR

The paper tackles cross-modal retrieval across a broad set of input modalities by proposing two formulations: PCMC, a multimodal extension of CLIP’s contrastive objective, and PCMR, a non-contrastive regression objective that nudges cross-modal similarities toward a task-driven target. By projecting each modality into a shared embedding space and leveraging pairwise interactions, the authors demonstrate that combining multiple modalities improves retrieval performance and enables zero-shot classification and query/enrichment strategies. Across Flickr8k and CUB, the approach yields competitive results compared to specialized bimodal models, with notable gains when incorporating additional modalities and informative class embeddings. These methods offer scalable, flexible tools for multimodal coordination in real-world settings where diverse sensor inputs and descriptions must be aligned for retrieval and classification tasks.

Abstract

Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one. Due to the wide range of practical applications, the problem has been mainly focused on the vision and language case, e.g. text to image retrieval, where models like CLIP have proven effective in solving such tasks. The dominant approach to learning such coordinated representations consists of projecting them onto a common space where matching views stay close and those from non-matching pairs are pushed away from each other. Although this cross-modal coordination has been applied also to other pairwise combinations, extending it to an arbitrary number of diverse modalities is a problem that has not been fully explored in the literature. In this paper, we propose two different approaches to the problem. The first is based on an extension of the CLIP contrastive objective to an arbitrary number of input modalities, while the second departs from the contrastive formulation and tackles the coordination problem by regressing the cross-modal similarities towards a target that reflects two simple and intuitive constraints of the cross-modal retrieval task. We run experiments on two different datasets, over different combinations of input modalities and show that the approach is not only simple and effective but also allows for tackling the retrieval problem in novel ways. Besides capturing a more diverse set of pair-wise interactions, we show that we can use the learned representations to improve retrieval performance by combining the embeddings from two or more such modalities.

Cross-Modal Coordination Across a Diverse Set of Input Modalities

TL;DR

The paper tackles cross-modal retrieval across a broad set of input modalities by proposing two formulations: PCMC, a multimodal extension of CLIP’s contrastive objective, and PCMR, a non-contrastive regression objective that nudges cross-modal similarities toward a task-driven target. By projecting each modality into a shared embedding space and leveraging pairwise interactions, the authors demonstrate that combining multiple modalities improves retrieval performance and enables zero-shot classification and query/enrichment strategies. Across Flickr8k and CUB, the approach yields competitive results compared to specialized bimodal models, with notable gains when incorporating additional modalities and informative class embeddings. These methods offer scalable, flexible tools for multimodal coordination in real-world settings where diverse sensor inputs and descriptions must be aligned for retrieval and classification tasks.

Abstract

Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one. Due to the wide range of practical applications, the problem has been mainly focused on the vision and language case, e.g. text to image retrieval, where models like CLIP have proven effective in solving such tasks. The dominant approach to learning such coordinated representations consists of projecting them onto a common space where matching views stay close and those from non-matching pairs are pushed away from each other. Although this cross-modal coordination has been applied also to other pairwise combinations, extending it to an arbitrary number of diverse modalities is a problem that has not been fully explored in the literature. In this paper, we propose two different approaches to the problem. The first is based on an extension of the CLIP contrastive objective to an arbitrary number of input modalities, while the second departs from the contrastive formulation and tackles the coordination problem by regressing the cross-modal similarities towards a target that reflects two simple and intuitive constraints of the cross-modal retrieval task. We run experiments on two different datasets, over different combinations of input modalities and show that the approach is not only simple and effective but also allows for tackling the retrieval problem in novel ways. Besides capturing a more diverse set of pair-wise interactions, we show that we can use the learned representations to improve retrieval performance by combining the embeddings from two or more such modalities.
Paper Structure (15 sections, 9 equations, 3 figures, 7 tables)

This paper contains 15 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Average cross-modal performance (avg. r@1) using $2,\dots,M$ modalities. Flickr8k: PCMC (C, red), PCMR (R, blue), frozen backbones. CUB: PCMC, fine-tuned backbones.
  • Figure 2: Qualitative cross-modal retrieval examples for enriched query (first two rows) and database vectors (last two rows). See text for details. Best viewed in color and with magnification.
  • Figure 3: t-SNE projections of the text (left), image (middle), and class embeddings (right) learned on the CUB dataset using PCMC.