Table of Contents
Fetching ...

GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval

Yuting Wang, Jinpeng Wang, Bin Chen, Tao Dai, Ruisheng Luo, Shu-Tao Xia

TL;DR

GMMFormer v2 tackles partially relevant video retrieval by tackling moment uncertainty and text-clip misalignment in untrimmed videos. It introduces a temporal consolidation module to adaptively fuse multi-scale contextual features, paired with uncertainty-aware cross-modal matching: a revamped query diverse loss that encourages hard-text pair discrimination, and an optimal matching loss via the Hungarian algorithm to diversify text-clip assignments. Together, these components suppress semantic collapse and promote fine-grained text-clip alignment, yielding state-of-the-art or near-state-of-the-art results across three PRVR benchmarks and demonstrating transferable benefits as plugin supervision for other models. The approach achieves improved Moment localization efficiency and effectiveness, with practical implications for scalable retrieval over long, untrimmed video collections.

Abstract

Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments. Due to the lack of moment annotations, the uncertainty lying in clip modeling and text-clip correspondence leads to major challenges. Despite the great progress, existing solutions either sacrifice efficiency or efficacy to capture varying and uncertain video moments. What's worse, few methods have paid attention to the text-clip matching pattern under such uncertainty, exposing the risk of semantic collapse. To address these issues, we present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module upon multi-scale contextual features, which maintains efficiency and improves the perception for varying moments. To achieve uncertainty-aware text-clip matching, we upgrade the query diverse loss in GMMFormer to facilitate fine-grained uniformity and propose a novel optimal matching loss for fine-grained text-clip alignment. Their collaboration alleviates the semantic collapse phenomenon and neatly promotes accurate correspondence between texts and moments. We conduct extensive experiments and ablation studies on three PRVR benchmarks, demonstrating remarkable improvement of GMMFormer v2 compared to the past SOTA competitor and the versatility of uncertainty-aware text-clip matching for PRVR. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer_v2}.

GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval

TL;DR

GMMFormer v2 tackles partially relevant video retrieval by tackling moment uncertainty and text-clip misalignment in untrimmed videos. It introduces a temporal consolidation module to adaptively fuse multi-scale contextual features, paired with uncertainty-aware cross-modal matching: a revamped query diverse loss that encourages hard-text pair discrimination, and an optimal matching loss via the Hungarian algorithm to diversify text-clip assignments. Together, these components suppress semantic collapse and promote fine-grained text-clip alignment, yielding state-of-the-art or near-state-of-the-art results across three PRVR benchmarks and demonstrating transferable benefits as plugin supervision for other models. The approach achieves improved Moment localization efficiency and effectiveness, with practical implications for scalable retrieval over long, untrimmed video collections.

Abstract

Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments. Due to the lack of moment annotations, the uncertainty lying in clip modeling and text-clip correspondence leads to major challenges. Despite the great progress, existing solutions either sacrifice efficiency or efficacy to capture varying and uncertain video moments. What's worse, few methods have paid attention to the text-clip matching pattern under such uncertainty, exposing the risk of semantic collapse. To address these issues, we present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module upon multi-scale contextual features, which maintains efficiency and improves the perception for varying moments. To achieve uncertainty-aware text-clip matching, we upgrade the query diverse loss in GMMFormer to facilitate fine-grained uniformity and propose a novel optimal matching loss for fine-grained text-clip alignment. Their collaboration alleviates the semantic collapse phenomenon and neatly promotes accurate correspondence between texts and moments. We conduct extensive experiments and ablation studies on three PRVR benchmarks, demonstrating remarkable improvement of GMMFormer v2 compared to the past SOTA competitor and the versatility of uncertainty-aware text-clip matching for PRVR. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer_v2}.
Paper Structure (50 sections, 15 equations, 9 figures, 8 tables)

This paper contains 50 sections, 15 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (a) Explicit PRVR methods adopt multi-scale sliding windows to traverse all possible clips, which are redundant and inefficient. (b) Implicit methods improve efficiency by combining multi-scale information and generating fewer clip embeddings. However, the static aggregation is inflexible for capturing moments with unexpected moment-to-video ratios (M/Vs), e.g., the clip in the blue dotted box, beyond predefined clip masks. (c) We propose a temporal consolidation module to improve the clip modeling. By learning adaptive aggregation weights for different time points in a video, it is capable of perceiving video moments with varying lengths.
  • Figure 2: The overall architecture of GMMFormer v2.
  • Figure 3: The detailed architecture of TC-GMMBlock.
  • Figure 4: The semantic collapse problem and our solution. (a) With only basic retrieval training loss $\mathcal{L}^{basic}$, we find a semantic collapse phenomenon. (b) Query diverse loss $\mathcal{L}^{div}$ can preserve the text semantic structure by encouraging fine-grained uniformity. (c) Optimal matching loss $\mathcal{L}^{om}$ can assure non-redundant matching between text queries and relevant clips, neatly promoting fine-grained text-clip alignment. Red edges between text queries and clips reflect the optimal assignments.
  • Figure 5: (a) The impact of the hyper-parameter $\gamma$ in Eq. \ref{['eq_gamma']} and (b) the impact of the Gaussian variance $\sigma$ on ActivityNet Captions.$\sigma = \infty$ means the vanilla Transformer encoder layer.
  • ...and 4 more figures