Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Yabing Wang; Le Wang; Qiang Zhou; Zhibin Wang; Hao Li; Gang Hua; Wei Tang

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, Wei Tang

TL;DR

This work tackles cross-lingual cross-modal retrieval (CCR) by addressing the semantic gap between visual content and non-English queries. It proposes LECCR, a two-stream CCR framework that leverages a multimodal large language model (MLLM) to generate detailed visual descriptions, which are aggregated into multi-view semantic slots and interact with visual features to enrich semantics. A multi-level matching scheme paired with softened matching under English guidance further aligns visual and non-English representations, yielding improved CCR performance on four benchmarks (Multi30K, MSCOCO, VATEX, MSR-VTT-CN). Extensive experiments demonstrate that LECCR surpasses strong two-stream baselines and remains efficient, highlighting the practical value of integrating MLLMs for cross-lingual multimodal alignment.

Abstract

Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on four CCR benchmarks, \ie Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our proposed method. Code: \url{https://github.com/LiJiaBei-7/leccr}.

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

TL;DR

Abstract

Paper Structure (19 sections, 15 equations, 4 figures, 8 tables)

This paper contains 19 sections, 15 equations, 4 figures, 8 tables.

Introduction
Related Work
Cross-lingual Cross-modal Retrieval (CCR)
LLM-enhanced Vision-Language Models
Methods
Preliminary
Multi-view Semantic Slots Generation
Multi-view Visual-Semantic Interaction
Multi-level Matching
Softened matching under English Guidance
Training and Inference
Experiment
Experimental Settings
Evaluation on Cross-lingual Image-Text Retrieval
Evaluation on Cross-lingual Video-Text Retrieval
...and 4 more sections

Figures (4)

Figure 3: Overview of the proposed LECCR framework. We utilize the multi-modal large language model (MLLM) to generate detailed visual descriptions, which are then employed as internal features to enhance the visual representations. Additionally, we introduce multi-level matching and softened matching under English guidance to improve the alignment between visual and non-English representations.
Figure 4: The example of the visual description generated using MLLM.
Figure 5: The performance of different numbers (#view) of semantic slots.
Figure 6: The visualization of multi-view semantic slots in multi-view semantic interaction module (#view = 4). Each semantic slot can distinctly focus on local semantics within the images.

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

TL;DR

Abstract

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)