Table of Contents
Fetching ...

Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

Zhijie Nie, Richong Zhang, Zhangchi Feng, Hailang Huang, Xudong Liu

TL;DR

This work proposes a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias, and proposes a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance.

Abstract

Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.

Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

TL;DR

This work proposes a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias, and proposes a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance.

Abstract

Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.

Paper Structure

This paper contains 64 sections, 2 theorems, 9 equations, 6 figures, 8 tables.

Key Result

lemma 1

Suppose that $\theta$ is the angle between the practical and correct alignment direction of $\hat{t}_{n}$. If and only if English texts can be aligned well with images, i.e. $\alpha$ tends to 0, then $\theta$ will converge to 0.

Figures (6)

  • Figure 1: Two inconsistency problems exist in the current cross-lingual cross-modal pre-training methods, leading to inconsistent recall and ranking in cross-lingual cross-modal retrieval separately.
  • Figure 2: Theoretical analysis and empirical observation for inconsistency in Recall@K. (a) An illustration of Lemma \ref{['lemma1']}, where the green arrow represents the correct alignment direction, while the red arrow represents the practical alignment direction. (b) A comparison of infoNCE loss value in different scenarios. We pre-trained and recorded loss changes using SimCSE gao2021simcse in the uni-modal setting, ALBEF li2021align in the cross-model setting and CCLM zeng2022cross in CCP, respectively, while keeping other settings as identical as possible.
  • Figure 3: Theoretical analysis and empirical observation for inconsistency in Rank. (a) An illustration of Lemma \ref{['lemma2']}, where the green arrow represents the correct alignment direction, while the red arrow represents the practical alignment direction. (b) A Visualization of T-SNE with 10 instances randomly sampled in xFlickr&CO. The representations are obtained by Swin Transformer liu2021swin and the first half (first six layers) of XLM-R conneau2020unsupervised following the setting in CCLM zeng2022cross.
  • Figure 4: The overview of our pre-training tasks, model architecture, and evaluation metrics.
  • Figure 5: Futher Study in Alignment Process.
  • ...and 1 more figures

Theorems & Definitions (2)

  • lemma 1
  • lemma 2