TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Xiaolun Jing, Genke Yang, Jian Chu
TL;DR
TC-MGC tackles the misalignment between broad video semantics and textual descriptions by introducing a text-conditioned, multi-grained contrastive framework that refines video representations via language-guided attention. It couples a similarity reorganization, similarity decorrelation regularization, and a linear softmax aggregation to effectively integrate fine-grained (word-frame, frame-word) and coarse-grained (sentence-video) signals. Empirically, TC-MGC achieves competitive to state-of-the-art results on MSR-VTT, DiDeMo, and VATEX, with consistent gains from the text-conditioned cross-modal interactions and ablations confirming the value of each component. The approach highlights the importance of conditioning visual representations on textual context and of balancing multi-grained similarities for robust text–video retrieval, albeit with higher computational cost due to the language–video attention.
Abstract
Motivated by the success of coarse-grained or fine-grained contrast in text-video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multigrained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text-video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.
