Table of Contents
Fetching ...

Deep Reversible Consistency Learning for Cross-modal Retrieval

Ruitao Pu, Yang Qin, Dezhong Peng, Xiaomin Song, Huiming Zheng

TL;DR

This work tackles cross-modal retrieval under modality heterogeneity by eliminating the need for pairwise data and fixed priors. It introduces DRCL, a two-module framework with Selective Prior Learning (SPL) to choose a high-quality prior $W$ per modality and Reversible Semantic Consistency learning (RSC) that recasts modality-invariant representations via the generalized inverse $W^{-1}$, aided by embedding-space Mixup (FA) and the losses ${\mathcal J}_L$, ${\mathcal J}_D$, and ${\mathcal J}_{MSE}$. Empirical results across five diverse datasets and 15 baselines demonstrate state-of-the-art MAP performance and robust gains from SPL, MRR, and FA, with ablations validating each component. The approach enables flexible, scalable cross-modal learning across multiple modalities and improves semantic alignment between sample representations and labels, offering practical benefits for real-world multi-modal retrieval systems.

Abstract

Cross-modal retrieval (CMR) typically involves learning common representations to directly measure similarities between multimodal samples. Most existing CMR methods commonly assume multimodal samples in pairs and employ joint training to learn common representations, limiting the flexibility of CMR. Although some methods adopt independent training strategies for each modality to improve flexibility in CMR, they utilize the randomly initialized orthogonal matrices to guide representation learning, which is suboptimal since they assume inter-class samples are independent of each other, limiting the potential of semantic alignments between sample representations and ground-truth labels. To address these issues, we propose a novel method termed Deep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL includes two core modules, \ie Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). More specifically, SPL first learns a transformation weight matrix on each modality and selects the best one based on the quality score as the Prior, which greatly avoids blind selection of priors learned from low-quality modalities. Then, RSC employs a Modality-invariant Representation Recasting mechanism (MRR) to recast the potential modality-invariant representations from sample semantic labels by the generalized inverse matrix of the prior. Since labels are devoid of modal-specific information, we utilize the recast features to guide the representation learning, thus maintaining semantic consistency to the fullest extent possible. In addition, a feature augmentation mechanism (FA) is introduced in RSC to encourage the model to learn over a wider data distribution for diversity. Finally, extensive experiments conducted on five widely used datasets and comparisons with 15 state-of-the-art baselines demonstrate the effectiveness and superiority of our DRCL.

Deep Reversible Consistency Learning for Cross-modal Retrieval

TL;DR

This work tackles cross-modal retrieval under modality heterogeneity by eliminating the need for pairwise data and fixed priors. It introduces DRCL, a two-module framework with Selective Prior Learning (SPL) to choose a high-quality prior per modality and Reversible Semantic Consistency learning (RSC) that recasts modality-invariant representations via the generalized inverse , aided by embedding-space Mixup (FA) and the losses , , and . Empirical results across five diverse datasets and 15 baselines demonstrate state-of-the-art MAP performance and robust gains from SPL, MRR, and FA, with ablations validating each component. The approach enables flexible, scalable cross-modal learning across multiple modalities and improves semantic alignment between sample representations and labels, offering practical benefits for real-world multi-modal retrieval systems.

Abstract

Cross-modal retrieval (CMR) typically involves learning common representations to directly measure similarities between multimodal samples. Most existing CMR methods commonly assume multimodal samples in pairs and employ joint training to learn common representations, limiting the flexibility of CMR. Although some methods adopt independent training strategies for each modality to improve flexibility in CMR, they utilize the randomly initialized orthogonal matrices to guide representation learning, which is suboptimal since they assume inter-class samples are independent of each other, limiting the potential of semantic alignments between sample representations and ground-truth labels. To address these issues, we propose a novel method termed Deep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL includes two core modules, \ie Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). More specifically, SPL first learns a transformation weight matrix on each modality and selects the best one based on the quality score as the Prior, which greatly avoids blind selection of priors learned from low-quality modalities. Then, RSC employs a Modality-invariant Representation Recasting mechanism (MRR) to recast the potential modality-invariant representations from sample semantic labels by the generalized inverse matrix of the prior. Since labels are devoid of modal-specific information, we utilize the recast features to guide the representation learning, thus maintaining semantic consistency to the fullest extent possible. In addition, a feature augmentation mechanism (FA) is introduced in RSC to encourage the model to learn over a wider data distribution for diversity. Finally, extensive experiments conducted on five widely used datasets and comparisons with 15 state-of-the-art baselines demonstrate the effectiveness and superiority of our DRCL.
Paper Structure (20 sections, 9 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overall framework of DRCL includes two modules: Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). (a) In SPL, we first optimize the transformation weight matrices on each modality and select the best one based on the quality scores $\{\mathcal{S}_k\}^K_{k=1}$ as the prior to guide the subsequent RSC. (b) In RSC, we introduce a feature augmentation mechanism (FA) to encourage the model to learn over a wider data distribution. Then, we utilize the Modality-invariant Representation Recasting mechanism (MRR) to recast the modality-invariant representation ($\tilde{\mathbf{y}}^i_k\mathbf{W}^{-1}$) for each semantic category by the prior matrix $\mathbf{W}$ for the subsequent learning. Lastly, we employ $\mathcal{J}_L$ for semantic consistency in the label space, $\mathcal{J}_{D}$ to enhance intra-class compactness and inter-class discriminability, and $\mathcal{J}_{MSE}$ for semantic invariance in the common subspace. (c) is the overall training pipeline of DRCL, including one-time SPL and $K$ times RSC learning modality-specific encoders ($\mathcal{F}=\{F_k\}^K_{k=1}$) for all modalities.
  • Figure 2: Precision-recall curves on the Wikipedia and NUS-WIDE datasets. See the supplementary material for more results.
  • Figure 3: (a-b) The comparison of MAP@50 and MAP@all scores on the XMedia dataset with different values of hyperparameters $\alpha$ and $\beta$. (c) the comparison of MAP@50 and MAP@all scores on the XMedia dataset with different priors.