Table of Contents
Fetching ...

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

Han Fang, Xianghao Zang, Chao Ban, Zerun Feng, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun

TL;DR

ProTA addresses content asymmetry in text–video retrieval by treating each token as a probabilistic distribution and by performing cross-modal aggregation in both low- and high-dimensional spaces. The Dual Partial-related Aggregation (DPA) disentangles partially related content using a low-dimension attention mechanism and a high-dimension Gram matrix with multiple Gaussian kernels, while Token-based Probabilistic Alignment (TPA) models tokens with Gaussian distributions and uses the 2-Wasserstein distance to compute token-level similarity. KL regularization maintains distributional diversity and an adaptive contrastive loss (L^{contra}) with an adaptive margin further tightens positive pairs, yielding an objective L^{all} = L^{contra} + \beta L^{kl}. Across MSR-VTT, LSMDC, and DiDeMo, ProTA achieves significant improvements over state-of-the-art methods, demonstrating robust handling of intra- and inter-pair uncertainty and improved generalization to diverse cross-modal content.

Abstract

Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

TL;DR

ProTA addresses content asymmetry in text–video retrieval by treating each token as a probabilistic distribution and by performing cross-modal aggregation in both low- and high-dimensional spaces. The Dual Partial-related Aggregation (DPA) disentangles partially related content using a low-dimension attention mechanism and a high-dimension Gram matrix with multiple Gaussian kernels, while Token-based Probabilistic Alignment (TPA) models tokens with Gaussian distributions and uses the 2-Wasserstein distance to compute token-level similarity. KL regularization maintains distributional diversity and an adaptive contrastive loss (L^{contra}) with an adaptive margin further tightens positive pairs, yielding an objective L^{all} = L^{contra} + \beta L^{kl}. Across MSR-VTT, LSMDC, and DiDeMo, ProTA achieves significant improvements over state-of-the-art methods, demonstrating robust handling of intra- and inter-pair uncertainty and improved generalization to diverse cross-modal content.

Abstract

Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
Paper Structure (17 sections, 16 equations, 7 figures, 11 tables)

This paper contains 17 sections, 16 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: (a) Intra-pair uncertainty. Videos often contain more information than their captions. (b) Inter-pair uncertainty. The query is easy to match negative videos with similar semantics. (c) ProTA can align cross-modality tokens at a fine-grained level and adaptively adjust the token-level distribution, which handles these two kinds of uncertainty in a unified manner. (d) Intra/inter-pair examples from MSR-VTT xu2016msr.
  • Figure 2: Overview of Probabilistic Token Aggregation. The two stream encoders are adopted to estimate the probabilistic distribution. We propose dual partial-related aggregation for token-level interaction, handling the problem of intra-pair uncertainty. The token-based probabilistic alignment is introduced to minimize the representation uncertainty, tackling inter-pair uncertainty.
  • Figure 3: (a) Analysis of $L_{kl}$ by training with MSR-VTT-9k xu2016msr. (b) Adaptive margin versus performance in MSR-VTT-9k 1kA test, where $m$ represents average of $m_{t2v} + m_{v2t}$.
  • Figure 4: Visualization of probabilistic distribution.
  • Figure 5: Ablation of intra-modality aggregation. We adopt the same sampling strategy as paper and select 4 frames uniformly from frame sequences.
  • ...and 2 more figures