Table of Contents
Fetching ...

Learning Image-Text Matching with Optimal Partial Transport

Zhengxin Pan, Haishuai Wang, Fangyu Wu, Bailing Zhang, Jiajun Bu, Hongyang Chen

Abstract

Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging local-to-global similarities to eliminate the interference of irrelevant fragments. We conduct extensive evaluations of OMIT on two benchmark image-text retrieval datasets, namely Flickr30K and MS-COCO. The superior performance achieved by OMIT on both datasets unequivocally demonstrates its effectiveness in cross-modal matching. Furthermore, through comprehensive visualization analysis, we elucidate OMIT's inherent tendency towards focal matching, thereby shedding light on its efficacy. Our code is publicly available at https://github.com/ppanzx/OMIT.

Learning Image-Text Matching with Optimal Partial Transport

Abstract

Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging local-to-global similarities to eliminate the interference of irrelevant fragments. We conduct extensive evaluations of OMIT on two benchmark image-text retrieval datasets, namely Flickr30K and MS-COCO. The superior performance achieved by OMIT on both datasets unequivocally demonstrates its effectiveness in cross-modal matching. Furthermore, through comprehensive visualization analysis, we elucidate OMIT's inherent tendency towards focal matching, thereby shedding light on its efficacy. Our code is publicly available at https://github.com/ppanzx/OMIT.
Paper Structure (18 sections, 25 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 25 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between different image-text matching methods. (a) VSE directly calculates the cosine similarity between two average vectors. (b) PEM substitutes the cosine similarity with the divergence of parametric distributions. (c) CAM measures the average cosine similarities between query fragments and their projections. (d) OMIT regards the Cross-modal Mover's Distance as the dissimilarity.
  • Figure 2: Overview of the OMIT. We begin by projecting the caption and image to the common space using textual and visual encoders, resulting in cross-modal fragments. Next, we employ the partial matching and calculate the Sinkhorn distance between fragments in the Partial Sinkhorn Matching module. Finally, we leverage the contrastive loss to group relevant fragments and disti nguish irrelevant fragments.
  • Figure 3: Upper: Performance curve on FLickr30K of hyper-parameters $\lambda$; Lower: Visualized comparison between column-wise transport plans under different $\lambda$.
  • Figure 4: Visual comparisons of alignments between Our OMIT and CAM.
  • Figure 5: An illustrative example of image-text matching showcasing the column-wise transport plan generated by our OMIT variants. Left: region-based, Right: grid-based.
  • ...and 4 more figures