Table of Contents
Fetching ...

Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching

Xiang Ma, Xuemei Li, Lexin Fang, Caiming Zhang

TL;DR

A novel method called DIAS is proposed to bridge the modality gap from two aspects: the information representation of embeddings from different modalities in corresponding dimension is aligned and the spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model.

Abstract

Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3\%-10.2\% rSum improvements on Flickr30k and MSCOCO benchmarks.

Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching

TL;DR

A novel method called DIAS is proposed to bridge the modality gap from two aspects: the information representation of embeddings from different modalities in corresponding dimension is aligned and the spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model.

Abstract

Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3\%-10.2\% rSum improvements on Flickr30k and MSCOCO benchmarks.

Paper Structure

This paper contains 14 sections, 17 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of distance consistency.
  • Figure 2: Overview of DIAS, which mainly contains two steps: local embedding interaction and global embedding interaction. Firstly, DIAS extracts features from image regions and text words to construct local embeddings, and perfroms dimension information alignment to adjust the information representation of the embeddings in different dimensions ($\mathcal{L}_{dim}$). Then, we aggregates local embeddings to construct global embeddings. Inter- and intra-modalities spatial constraints are obtained from distance relationship between global embeddings, to suppress the influence of the modality gap, and the sparse conrrelation algorithm is used to select the strong correlated spatial relationships ($\mathcal{L}_{inter}$ and $\mathcal{L}_{intra}$). Finally, the image-text relevance is inferred via a contrastive learning loss function ($\mathcal{L}_{loc}$).
  • Figure 3: Illustration of dimension information alignment. We extract the dimension vector of each dimention, and construct the correlation matrix by calculating the correlation between dimension vectors from different modalities. The proposed regularizer is used on the correlation matrix to align information repersentaion of each dimension.
  • Figure 4: The histogram statistics of spatial distance between instances within and across modalities. We randomly selected some images and texts to calculating their distance, and observe the distribution pattern. It can be observed that the inter- and intra-modalities distance distribution approaches a normal distribution. These embeddings used for computation are from the state-of-the-art method zhang2024identification.
  • Figure 5: Illustrasion for sparse correlation algorithm. We obtain the spatial matrix $\textbf{L}_{x}$, and the model learns a soft-threshold based on the conditional probability to select strong correlation for each instance.
  • ...and 2 more figures