Table of Contents
Fetching ...

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius

TL;DR

RGNet addresses long video temporal grounding by unifying clip retrieval and moment grounding within a single transformer-based network. The RG-Encoder performs cross-modal retrieval at clip and frame granularity using sparse attention and a learnable retrieval token, while a grounding decoder predicts precise moment boundaries from retrieved clips. The model is trained with intra-clip attention loss, inter-clip contrastive loss, and a grounding loss, enabling end-to-end optimization and mutual enhancement of retrieval and grounding. Empirical results on MAD and Ego4D-NLQ show state-of-the-art performance, with substantial improvements in retrieval and grounding due to the integrated, end-to-end design and targeted losses. This approach advances practical LVTG by closely modeling long-video semantics and reducing the gap between retrieval and grounding in hour-long videos.

Abstract

Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

TL;DR

RGNet addresses long video temporal grounding by unifying clip retrieval and moment grounding within a single transformer-based network. The RG-Encoder performs cross-modal retrieval at clip and frame granularity using sparse attention and a learnable retrieval token, while a grounding decoder predicts precise moment boundaries from retrieved clips. The model is trained with intra-clip attention loss, inter-clip contrastive loss, and a grounding loss, enabling end-to-end optimization and mutual enhancement of retrieval and grounding. Empirical results on MAD and Ego4D-NLQ show state-of-the-art performance, with substantial improvements in retrieval and grounding due to the integrated, end-to-end design and targeted losses. This approach advances practical LVTG by closely modeling long-video semantics and reducing the gap between retrieval and grounding in hour-long videos.

Abstract

Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.
Paper Structure (21 sections, 13 equations, 11 figures, 5 tables)

This paper contains 21 sections, 13 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of RGNet. It predicts the moment boundary specified by textual queries from an hour-long video. First, our proposed RG-Encoder maps the video and text features to a joint space and retrieves the relevant clip feature. The subsequent grounding decoder processes the retrieved features to predict the beginning and end times of the moment. The encoder parallelly operates at multiple levels of granularity (e.g., clip and frame) to achieve an end-to-end-solution.
  • Figure 2: Unified Solution. (left) Existing methods involve a separate retrieval and grounding network. The disjoint retrieval lacks fine-grained event understanding, which is crucial for moment localization. (right) Our unified network architecture overcomes it by deeply integrating the retrieval module with the grounding objective.
  • Figure 3: Overview of RG-Encoder. It takes video clips and textual query as input and retrieves the relevant clip features. First, a cross-attention fuses the clips with text, and the sparsifier masks the out-of-moment frames. Based on the mask, the retrieval attention focuses on in-moment frames (colored red) and generates clip-level context and frame-level content features. We combine the context and content to generate the retrieved clip feature.
  • Figure 4: Impact of number of retrieved clips. Reducing the number of clips speeds up the network execution time. While the baseline model experiences a significant drop in performance with this reduction, RGNet shows a noticeably smaller decline in performance under the same conditions.
  • Figure 5: Impact of retrieved clip length. Longer clips result in improved retrieval due to fewer candidates. However, grounding becomes exceedingly difficult with longer clips. For example, the grounding performance drops after 180 seconds and 48 seconds for the MAD and Ego4D datasets.
  • ...and 6 more figures