Table of Contents
Fetching ...

TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval

Xiaolun Jing, Genke Yang, Jian Chu

TL;DR

TC-MGC tackles the misalignment between broad video semantics and textual descriptions by introducing a text-conditioned, multi-grained contrastive framework that refines video representations via language-guided attention. It couples a similarity reorganization, similarity decorrelation regularization, and a linear softmax aggregation to effectively integrate fine-grained (word-frame, frame-word) and coarse-grained (sentence-video) signals. Empirically, TC-MGC achieves competitive to state-of-the-art results on MSR-VTT, DiDeMo, and VATEX, with consistent gains from the text-conditioned cross-modal interactions and ablations confirming the value of each component. The approach highlights the importance of conditioning visual representations on textual context and of balancing multi-grained similarities for robust text–video retrieval, albeit with higher computational cost due to the language–video attention.

Abstract

Motivated by the success of coarse-grained or fine-grained contrast in text-video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multigrained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text-video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.

TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval

TL;DR

TC-MGC tackles the misalignment between broad video semantics and textual descriptions by introducing a text-conditioned, multi-grained contrastive framework that refines video representations via language-guided attention. It couples a similarity reorganization, similarity decorrelation regularization, and a linear softmax aggregation to effectively integrate fine-grained (word-frame, frame-word) and coarse-grained (sentence-video) signals. Empirically, TC-MGC achieves competitive to state-of-the-art results on MSR-VTT, DiDeMo, and VATEX, with consistent gains from the text-conditioned cross-modal interactions and ablations confirming the value of each component. The approach highlights the importance of conditioning visual representations on textual context and of balancing multi-grained similarities for robust text–video retrieval, albeit with higher computational cost due to the language–video attention.

Abstract

Motivated by the success of coarse-grained or fine-grained contrast in text-video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multigrained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text-video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.

Paper Structure

This paper contains 42 sections, 24 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: Illustration of the multi-grained contrasts between frame and sentence (word) representations, including sentence-frame (cross-grained) and frame-word (fine-grained) contrasts. The connections indicate that the texts are semantic-relevant to sub-regions of videos.
  • Figure 2: The pipeline of TC-MGC. Given pair-wise text-video data, CLIP encoders simultaneously extract textual and visual representations, of which the extracted frame features are fed into the temporal encoder block for sequential modeling. Through language-video attention block, video representations with different granularity are regenerated in a text-guided manner. Finally, multi-grained interaction is implemented on the textual representations and text-conditioned visual representations to obtain the similarity score.
  • Figure 3: The diagram of language-video attention block. For the textual representations, query projection is employed to obtain $Q_{t}$, including query-projected coarse-grained sentence embedding and fine-grained word embeddings. We similarly use key and value projections to obtain $K_{v}$ and $V_{v}$ from frame representations. After relevance weights calculation between textual and frame embeddings through scaled dot product, we aggregate frame embeddings with computed attention scores to obtain semantic-relevant video representations $\hat{z}_{v|t}$, which are passed through a fully connected layer and residual connection to obtain sentence-conditioned video representation and word-conditioned frame representations.
  • Figure 4: Left: the illustration of multi-grained interaction mechanism. We first use matrix multiplication to obtain video-sentence similarity score, video-word and sentence-frame similarity vectors, frame-word similarity matrices respectively, followed by SR and Bi-SR modules to achieve similarity vectors and matrices reorganization. Next, we perform ISA and Bi-ISA modules on the reorganized similarity vectors and matrices to generate instance-level scores. Finally, we employ LSA module to achieve multi-grained scores aggregation. Right: the overview of LSA, which leverages the cascade of linear and softmax layers to calculate the weights of different instance-level scores.
  • Figure 5: Similarity Reorganization modules (SR). (a) We identify and rearrange the attentive similarities as the reorganized video-word vector. (b) We preserve the attentive similarities and fuse the inattentive similarities into one similarity, which are concatenated to generate the reorganized sentence-frame vector. (c) We extend SR module to bidirectional SR (Bi-SR) to obtain the reorganized frame-word matrix.
  • ...and 11 more figures