ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Han Fang, Xianghao Zang, Chao Ban, Zerun Feng, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun
TL;DR
ProTA addresses content asymmetry in text–video retrieval by treating each token as a probabilistic distribution and by performing cross-modal aggregation in both low- and high-dimensional spaces. The Dual Partial-related Aggregation (DPA) disentangles partially related content using a low-dimension attention mechanism and a high-dimension Gram matrix with multiple Gaussian kernels, while Token-based Probabilistic Alignment (TPA) models tokens with Gaussian distributions and uses the 2-Wasserstein distance to compute token-level similarity. KL regularization maintains distributional diversity and an adaptive contrastive loss (L^{contra}) with an adaptive margin further tightens positive pairs, yielding an objective L^{all} = L^{contra} + \beta L^{kl}. Across MSR-VTT, LSMDC, and DiDeMo, ProTA achieves significant improvements over state-of-the-art methods, demonstrating robust handling of intra- and inter-pair uncertainty and improved generalization to diverse cross-modal content.
Abstract
Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
