Table of Contents
Fetching ...

Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation

Yang Yang, Wenjuan Xi, Luping Zhou, Jinhui Tang

TL;DR

This work addresses modal imbalance in vision-language retrieval by shifting from exact instance-level cross-modal matching to structure-preserving learning. It introduces a cross-modal student model guided by two modal-independent teachers and a multi-granularity distillation scheme that combines representation-level and structure-aware losses. The method computes inter- and intra-modal relational matrices and uses a learnable fusion of teacher relations, with MAE-based distillation enforcing geometric consistency in the latent space. Empirical results on MS-COCO, Flickr30K, and VizWiz show improvements in cross-modal, single-modal, and mixed retrieval, along with strong ablations and generalization to other architectures. The approach offers a plug-and-play module that enhances structure-preserving cross-modal learning, with clear implications for robust, balanced multi-modal retrieval systems.

Abstract

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.

Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation

TL;DR

This work addresses modal imbalance in vision-language retrieval by shifting from exact instance-level cross-modal matching to structure-preserving learning. It introduces a cross-modal student model guided by two modal-independent teachers and a multi-granularity distillation scheme that combines representation-level and structure-aware losses. The method computes inter- and intra-modal relational matrices and uses a learnable fusion of teacher relations, with MAE-based distillation enforcing geometric consistency in the latent space. Empirical results on MS-COCO, Flickr30K, and VizWiz show improvements in cross-modal, single-modal, and mixed retrieval, along with strong ablations and generalization to other architectures. The approach offers a plug-and-play module that enhances structure-preserving cross-modal learning, with clear implications for robust, balanced multi-modal retrieval systems.

Abstract

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The impact of imbalanced modalities on single-modal retrieval in cross-modal learning. Figures (a) and (b) respectively present the NDCG@{10,20,50} results of the single-modal encoders after training two cross-modal models, SCAN LeeCHHH18 and BLIP DBLP:conf/icml/0001LXH22 on the MS-COCO (1K) dataset. The asterisk (*) indicates that the large models were retrained from scratch. "Best" denotes the results obtained by the best single-modal models trained separately on images and texts. The cross-modal model and single-modal models adopt identical network architectures.
  • Figure 2: T-SNE visualization of Best T2T, BLIP* T2T, ViT2BERT T2T on the FLICKR30K dataset, where BLIP* T2T uses the ViT/B and BERT as backbones. We randomly choose a text query and a database with 150 samples. The blue pentagram represents the text query, while the top-5 retrieval candidates are shown as triangles. Ground-truth candidates are marked in green, and non-ground-truth candidates are marked in red.
  • Figure 3: Illustration of our framework. Expanding on the framework of cross-modal matching, we incorporate a single-modal teacher network. Our multi-granularity distillation includes representation-level distillation and structure-aware distillation, where the former optimizes the expressive capabilities of individual modalities via contrastive loss, and the latter enhances instance-level matching through structure-aware distillation.
  • Figure 4: The results of I2IT$@10$ (a--d), T2IT$@10$ (e--h) of mixed retrieval task. The method with “+" sign, i.e., X-VLM*+, is our method.
  • Figure 5: Parameter analyses. We verify the influence of parameters our method under FLICKR30K and Vizwiz datasets.
  • ...and 2 more figures