Table of Contents
Fetching ...

Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval

Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang

TL;DR

This work tackles cross-lingual cross-modal retrieval (CCR) in low-resource target languages by introducing Dynamic Adapter with Semantics Disentangling (DASD). DASD freezes a pretrained vision-language backbone and generates input-conditioned adapters from disentangled caption semantics, comprising semantic-related and semantic-agnostic features, guided by semantic consistency and adversarial losses. The method achieves state-of-the-art results on image-text and video-text CCR benchmarks under both finetune and zero-shot settings, and proves compatibility with multiple VLP models. The approach offers a parameter-efficient, data-effective path for expanding multimodal retrieval to many languages, with practical impact on multilingual search and accessibility.

Abstract

Existing cross-modal retrieval methods typically rely on large-scale vision-language pair data. This makes it challenging to efficiently develop a cross-modal retrieval model for under-resourced languages of interest. Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data, has gained increasing attention. As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language. However, these adapters are usually static once learned, making it difficult to adapt to target-language captions with varied expressions. To alleviate it, we propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions. Considering that the semantics and expression styles of the input caption largely influence how to encode it, we propose a semantic disentangling module to extract the semantic-related and semantic-agnostic features from the input, ensuring that generated adapters are well-suited to the characteristics of input caption. Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models.

Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval

TL;DR

This work tackles cross-lingual cross-modal retrieval (CCR) in low-resource target languages by introducing Dynamic Adapter with Semantics Disentangling (DASD). DASD freezes a pretrained vision-language backbone and generates input-conditioned adapters from disentangled caption semantics, comprising semantic-related and semantic-agnostic features, guided by semantic consistency and adversarial losses. The method achieves state-of-the-art results on image-text and video-text CCR benchmarks under both finetune and zero-shot settings, and proves compatibility with multiple VLP models. The approach offers a parameter-efficient, data-effective path for expanding multimodal retrieval to many languages, with practical impact on multilingual search and accessibility.

Abstract

Existing cross-modal retrieval methods typically rely on large-scale vision-language pair data. This makes it challenging to efficiently develop a cross-modal retrieval model for under-resourced languages of interest. Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data, has gained increasing attention. As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language. However, these adapters are usually static once learned, making it difficult to adapt to target-language captions with varied expressions. To alleviate it, we propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions. Considering that the semantics and expression styles of the input caption largely influence how to encode it, we propose a semantic disentangling module to extract the semantic-related and semantic-agnostic features from the input, ensuring that generated adapters are well-suited to the characteristics of input caption. Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models.

Paper Structure

This paper contains 44 sections, 12 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Illustration of the variety of textual expressions and the difference between the traditional adapter and our DASD: (a) Captions of the same image are differently expressed in Chinese-specific ways. (b) Traditional adapters whose parameters are fixed once learned. (c) Our method extracts semantic-related and semantic-agnostic features from captions and thereby produces dynamic adapters (DA).
  • Figure 2: The illustration of our proposed Dynamic Adapter with Semantics Disentangling (DASD). To make dynamic adapters in the target-language branch $\Phi^T$ exactly match its input $S^T$, semantics disentangling is performed to extract semantic-related and semantic-agnostic features ($f^{sr}$ and $f^{sa}$) from $S^T$ and then generate input-conditional parameters (shown in the leftmost branch). The source-language branch $\Phi^S$ and visual branch $\Phi^V$ are provided by the frozen VLP model.
  • Figure 3: Visualization of the semantic-agnostic features extracted from 200 randomly-selected Chinese sentences in MSCOCO testset.
  • Figure 4: Performance of our model varies with the number of pretrained layers employed for semantic disentangling.
  • Figure 5: Visualization of the semantic-related features extracted from 200 randomly-selected Chinese sentences in MSCOCO testset.
  • ...and 2 more figures