Table of Contents
Fetching ...

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji

TL;DR

X-CLIP introduces an end-to-end multi-grained and cross-grained contrastive framework for video-text retrieval, integrating video-sentence, video-word, sentence-frame, and frame-word alignments. The Attention Over Similarity Matrix (AOSM) module dynamically weights and aggregates these multi-grained similarities to produce robust instance-level retrieval scores, addressing noise from irrelevant frames and words. Built on CLIP-based encoders with a temporal transformer for video modeling, X-CLIP achieves state-of-the-art results across five major video-text benchmarks, demonstrating the effectiveness of cross-grained contrast and AOSM. The approach offers a principled, scalable path for fine-grained semantic alignment in multi-modal retrieval with practical impact for large-scale video understanding tasks.

Abstract

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

TL;DR

X-CLIP introduces an end-to-end multi-grained and cross-grained contrastive framework for video-text retrieval, integrating video-sentence, video-word, sentence-frame, and frame-word alignments. The Attention Over Similarity Matrix (AOSM) module dynamically weights and aggregates these multi-grained similarities to produce robust instance-level retrieval scores, addressing noise from irrelevant frames and words. Built on CLIP-based encoders with a temporal transformer for video modeling, X-CLIP achieves state-of-the-art results across five major video-text benchmarks, demonstrating the effectiveness of cross-grained contrast and AOSM. The approach offers a principled, scalable path for fine-grained semantic alignment in multi-modal retrieval with practical impact for large-scale video understanding tasks.

Abstract

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.
Paper Structure (32 sections, 17 equations, 6 figures, 14 tables)

This paper contains 32 sections, 17 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: X-CLIP aims for improving video-text retrieval performance via multi-grained contrastive learning, including fine-grained (frame-word), coarse-grained (video-sentence) and cross-grained (video-word, sentence-frame) contrast. The transparency of words and frames represents the degree of relevance to query.
  • Figure 2: Illustration of the proposed X-CLIP model. The input sentences are processed by the text encoder to generate coarse-grained and fine-grained textual representations. The input video is sampled into ordinal frames and these frames are fed into the frame encoder to generate frame-level representations. The frame-level representations are then fed into the temporal encoder to capture the temporal relationships. The outputs of the temporal encoder are fine-grained visual representations, and the coarse-grained visual representation is obtained by averaging all these fine-grained features. Based on these representations, we calculate the video-sentence, video-word, sentence-frame, and frame-word similarity score.
  • Figure 3: Top-3 video-to-text retrieval results on MSR-VTT. The number in parentheses is the similarity score.
  • Figure 4: Top-3 text-to-video retrieval results on MSR-VTT. The number in parentheses is the similarity score.
  • Figure 5: Retrieval performance of models with different contrastive modules in different sizes of the training set on the MSR-VTT dataset.
  • ...and 1 more figures