Table of Contents
Fetching ...

MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching

Wanqing Cui, Rui Cheng, Jiafeng Guo, Xueqi Cheng

TL;DR

The paper tackles suboptimal fine-grained cross-modal matching in two-stream image-text models by limiting to a single vector representation. It proposes MVAM, which uses $m$ attention heads with learnable view codes to produce multi-view representations for images and text, concatenated for matching. A diversity loss, formulated via $A_vA_v^T - I$ and its variants, encourages diverse attention and yields improved retrieval on MSCOCO and Flickr30K. Results demonstrate consistent gains over baselines, with interpretable, view-specific attention patterns, and MVAM is presented as a plug-in to existing two-stream models to enable more robust cross-modal alignment.

Abstract

Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-view Attention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details. This diversity enables the model to represent image-text pairs from multiple perspectives, ensuring a more comprehensive understanding and alignment of critical content. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance. Our experiments on MSCOCO and Flickr30K demonstrate enhancements over existing models, and further case studies reveal that different attention heads can focus on distinct content, achieving more comprehensive representations.

MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching

TL;DR

The paper tackles suboptimal fine-grained cross-modal matching in two-stream image-text models by limiting to a single vector representation. It proposes MVAM, which uses attention heads with learnable view codes to produce multi-view representations for images and text, concatenated for matching. A diversity loss, formulated via and its variants, encourages diverse attention and yields improved retrieval on MSCOCO and Flickr30K. Results demonstrate consistent gains over baselines, with interpretable, view-specific attention patterns, and MVAM is presented as a plug-in to existing two-stream models to enable more robust cross-modal alignment.

Abstract

Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-view Attention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details. This diversity enables the model to represent image-text pairs from multiple perspectives, ensuring a more comprehensive understanding and alignment of critical content. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance. Our experiments on MSCOCO and Flickr30K demonstrate enhancements over existing models, and further case studies reveal that different attention heads can focus on distinct content, achieving more comprehensive representations.
Paper Structure (25 sections, 13 equations, 4 figures, 5 tables)

This paper contains 25 sections, 13 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Examples of retrieved images with CLIP and our model MVAM-CLIP. CLIP retrieved images ignore some important information of texts. MVAM-CLIP retrieved images are more consistent with the texts.
  • Figure 2: The network architecture of MVAM with ViT encoder.
  • Figure 3: The visualization of attention from different views. The left side provides the attention scores over text for all the 16 attention heads of MVAM, and the darker the grid color represents the greater the attention value. The right side shows the regions in the image that four of the view attention heads pay most attention to.
  • Figure 4: The retrieved top5 images of CLIP and MVAM-CLIP. We lay out images according to the retrieval score and the ground-truth images are in green boxes. For long and complex text queries, the images retrieved by MVAM-CLIP are more suitable.