Table of Contents
Fetching ...

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

Ziyu Gong, Chengcheng Mai, Yihua Huang

TL;DR

This work tackles the challenge of information asymmetry in image-text retrieval by introducing asymmetry-sensitive contrastive learning (AsCL). It generates targeted positives and negatives for three fine-grained asymmetry types and couples this with a hierarchical cross-modal fusion that combines region-word local interactions with global image-text alignment, formalized through $S(I,T)=u_1 S_{local}(I,T)+(1-u_1) S_{global}(I,T)$ and optimized by $L_{AsCL}(I,T)=\tfrac{1}{2}L_I(I,T)+\tfrac{1}{2}L_{II\&III}(I,T^+)$. The method demonstrates state-of-the-art results on MSCOCO and Flickr30K, with ablations confirming the importance of generated samples and cross-modal fusion for improved alignment and uniformity in the embedding space. By enhancing sensitivity to fine-grained cross-modal differences and improving robustness to short text queries, AsCL advances practical image-text retrieval in real-world multimodal systems.

Abstract

The image-text retrieval task aims to retrieve relevant information from a given image or text. The main challenge is to unify multimodal representation and distinguish fine-grained differences across modalities, thereby finding similar contents and filtering irrelevant contents. However, existing methods mainly focus on unified semantic representation and concept alignment for multi-modalities, while the fine-grained differences across modalities have rarely been studied before, making it difficult to solve the information asymmetry problem. In this paper, we propose a novel asymmetry-sensitive contrastive learning method. By generating corresponding positive and negative samples for different asymmetry types, our method can simultaneously ensure fine-grained semantic differentiation and unified semantic representation between multi-modalities. Additionally, a hierarchical cross-modal fusion method is proposed, which integrates global and local-level features through a multimodal attention mechanism to achieve concept alignment. Extensive experiments performed on MSCOCO and Flickr30K, demonstrate the effectiveness and superiority of our proposed method.

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

TL;DR

This work tackles the challenge of information asymmetry in image-text retrieval by introducing asymmetry-sensitive contrastive learning (AsCL). It generates targeted positives and negatives for three fine-grained asymmetry types and couples this with a hierarchical cross-modal fusion that combines region-word local interactions with global image-text alignment, formalized through and optimized by . The method demonstrates state-of-the-art results on MSCOCO and Flickr30K, with ablations confirming the importance of generated samples and cross-modal fusion for improved alignment and uniformity in the embedding space. By enhancing sensitivity to fine-grained cross-modal differences and improving robustness to short text queries, AsCL advances practical image-text retrieval in real-world multimodal systems.

Abstract

The image-text retrieval task aims to retrieve relevant information from a given image or text. The main challenge is to unify multimodal representation and distinguish fine-grained differences across modalities, thereby finding similar contents and filtering irrelevant contents. However, existing methods mainly focus on unified semantic representation and concept alignment for multi-modalities, while the fine-grained differences across modalities have rarely been studied before, making it difficult to solve the information asymmetry problem. In this paper, we propose a novel asymmetry-sensitive contrastive learning method. By generating corresponding positive and negative samples for different asymmetry types, our method can simultaneously ensure fine-grained semantic differentiation and unified semantic representation between multi-modalities. Additionally, a hierarchical cross-modal fusion method is proposed, which integrates global and local-level features through a multimodal attention mechanism to achieve concept alignment. Extensive experiments performed on MSCOCO and Flickr30K, demonstrate the effectiveness and superiority of our proposed method.
Paper Structure (14 sections, 7 equations, 5 figures, 2 tables)

This paper contains 14 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Fine-grained information asymmetry types.
  • Figure 2: The overview of our proposed AsCL Method.
  • Figure 3: Performance comparison of diverse samples based on fine-grained generation strategies for each asymmetry type. (a) Influence of different generated positives according to Asymmetry-II and Asymmetry-III. (b) Influence of generated negatives with different noise addition strategies according to Asymmetry-I. $\uparrow$ means higher, better.
  • Figure 4: Mean distance between positive pairs(a) and negative pairs(b) in high-dimensional space on MSCOCO. $\uparrow$ means higher, better. $\downarrow$ means lower, better.
  • Figure 5: Image retrieval for text queries of different lengths.