Table of Contents
Fetching ...

End-to-End Dense Video Grounding via Parallel Regression

Fengyuan Shi, Weilin Huang, Limin Wang

TL;DR

This work tackles dense video grounding by reframing the task as language-conditioned regression. It introduces PRVG, an end-to-end Transformer-inspired framework that uses paragraph-level language queries to directly regress temporal boundaries for each sentence in parallel, without proposal generation or post-processing. The model comprises a Contextualized Representation Encoder, a Language Modulated Decoder, and a Parallel Regression Head, trained with a regression loss and a robust proposal-level attention loss to guide attention on ground-truth regions. Experiments on ActivityNet Captions and TACoS demonstrate competitive or superior performance to state-of-the-art methods, with notable gains in efficiency and the ability to handle both sparse and dense grounding scenarios. The approach also offers interpretability through language-driven queries and provides insights into the benefits of parallel decoding for multimodal grounding tasks.

Abstract

Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key design in our PRVG is to use languages as queries, and directly regress the moment boundaries based on language-modulated visual representations. Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes (sparse or dense grounding) and allows for efficient inference without any post-processing technique. In addition, we devise a robust proposal-level attention loss to guide the training of PRVG, which is invariant to moment duration and contributes to model convergence. We perform experiments on two video grounding benchmarks of ActivityNet Captions and TACoS, demonstrating that our PRVG can significantly outperform previous methods. We also perform in-depth studies to investigate the effectiveness of parallel regression paradigm on video grounding.

End-to-End Dense Video Grounding via Parallel Regression

TL;DR

This work tackles dense video grounding by reframing the task as language-conditioned regression. It introduces PRVG, an end-to-end Transformer-inspired framework that uses paragraph-level language queries to directly regress temporal boundaries for each sentence in parallel, without proposal generation or post-processing. The model comprises a Contextualized Representation Encoder, a Language Modulated Decoder, and a Parallel Regression Head, trained with a regression loss and a robust proposal-level attention loss to guide attention on ground-truth regions. Experiments on ActivityNet Captions and TACoS demonstrate competitive or superior performance to state-of-the-art methods, with notable gains in efficiency and the ability to handle both sparse and dense grounding scenarios. The approach also offers interpretability through language-driven queries and provides insights into the benefits of parallel decoding for multimodal grounding tasks.

Abstract

Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key design in our PRVG is to use languages as queries, and directly regress the moment boundaries based on language-modulated visual representations. Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes (sparse or dense grounding) and allows for efficient inference without any post-processing technique. In addition, we devise a robust proposal-level attention loss to guide the training of PRVG, which is invariant to moment duration and contributes to model convergence. We perform experiments on two video grounding benchmarks of ActivityNet Captions and TACoS, demonstrating that our PRVG can significantly outperform previous methods. We also perform in-depth studies to investigate the effectiveness of parallel regression paradigm on video grounding.

Paper Structure

This paper contains 18 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An illustrative example of the dense video grounding task. Dense video grounding aims to jointly localize multiple temporally ordered moments described by a paragraph in an untrimmed video.
  • Figure 2: (a) Proposal-based pipeline. (b) Proposal-free pipeline. (c) Our proposed PRVG pipeline. Both proposal-based methods and proposal-free methods address video grounding task in an indirect "one-to-many" manner, i.e., generating many proposals for one language description, by pre-defining dense proposals or predicting dense proposals at all locations, which suffer from complicated label assignment and near-duplicate removal. PRVG directly regresses the temporal boundary for each language description via parallel regression, without a classification branch and extra post-processing such as Ranking or NMS.
  • Figure 3: Pipeline of PRVG. Our PRVG streamlines the process of dense video grounding with a direct and parallel decoding paradigm, which is composed of two steps: feature encoding and parallel regression. In feature encoding phrase, we extract the features of video and sentences by 3D CNN and LSTM, respectively. As for parallel regression, a Contextual Representation Encoder (CRE) is proposed to augment the feature representations with global context information for both modals, and a Language Modulated Decoder (LMD) using languages as queries coupled with a Parallel Regression Head (PRH) is proposed to directly predict the temporal moment for each sentence descriptions. Our PRVG is able to capture intra-modal global structure information for contextualized representation and model the cross-modal relation in a global view for flexible and accurate moment localization.
  • Figure 4: (a) Contextualized Representation Encoder (CRE). (b) Language Modulated Decoder (LMD). CRE is built on self-attention mechanism, whose input is the video or sentence features. CRE models long-term dependencies in video and extracts semantic relevance among the sentences in a paragraph, facilitating to contextualized representations for both modals. LMD is built on cross-attention mechanism, using language as queries and video features as keys and values. LMD aggregates visual information under the guidance of language queries for subsequent parallel regression.
  • Figure 5: The illustrations of two DETR-based video grounding methods and our proposed PRVG. (a) DETR-VG. (b) Language DETR. (c) PRVG. DETR-VG first performs element-wise multiplication for multi-modal fusion, and then use fixed number of learnable moment queries to decode the temporal boundary of the language description. Language DETR injects text features into the moment queries, and uses the fused language guided queies to decode the temporal boundary. While our PRVG uses language as queries for parallel regression. By removing the classification branch, PRVG can directly predict the temporal boundary for each language description, which gets rid of the complicated label assignment and post-processing. Moreover, PRVG accepts any number of language descriptions at once, thus can deal with both sparse and dense video grounding. Limited by the fixed number of moment queries, the other two methods can only perform sparse video grounding, and need time-consuming beam search for dense video grounding.
  • ...and 1 more figures