Table of Contents
Fetching ...

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie

TL;DR

This work introduces TR-DETR, a DETR-based framework for joint video moment retrieval (MR) and highlight detection (HD) guided by natural language queries. It leverages a local-global multi-modal alignment to reduce cross-modal gaps, a query-guided visual refinement to suppress irrelevant content, and a task cooperation module that propagates benefits between MR and HD via HD2MR and MR2HD pathways. Empirically, TR-DETR achieves state-of-the-art performance on QVHighlights, Charades-STA, and TVSum, with ablations demonstrating the importance of reciprocity-aware design and alignment regulators. The approach highlights the practical value of exploiting MR-HD reciprocity to improve multi-task video understanding and retrieval quality.

Abstract

Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at \url{https://github.com/mingyao1120/TR-DETR}.

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

TL;DR

This work introduces TR-DETR, a DETR-based framework for joint video moment retrieval (MR) and highlight detection (HD) guided by natural language queries. It leverages a local-global multi-modal alignment to reduce cross-modal gaps, a query-guided visual refinement to suppress irrelevant content, and a task cooperation module that propagates benefits between MR and HD via HD2MR and MR2HD pathways. Empirically, TR-DETR achieves state-of-the-art performance on QVHighlights, Charades-STA, and TVSum, with ablations demonstrating the importance of reciprocity-aware design and alignment regulators. The approach highlights the practical value of exploiting MR-HD reciprocity to improve multi-task video understanding and retrieval quality.

Abstract

Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at \url{https://github.com/mingyao1120/TR-DETR}.
Paper Structure (22 sections, 14 equations, 4 figures, 4 tables)

This paper contains 22 sections, 14 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The proposed TR-DETR involves several key steps. Initially, two frozen pre-trained networks are employed to extract visual and textual features from videos and queries. Subsequently, a local-global multi-modal alignment module is constructed to effectively align the extracted visual and textual features. Then, the visual features are refined under the guidance of textual features for obtaining discriminative joint features. Finally, a task cooperation module is implemented to enhance prediction outcomes based on task reciprocity. Additionally, two multi-head self-attention components share weights.
  • Figure 2: Qualitative results of TR-DETR on QVHighlights val set.
  • Figure 3: Qualitative results of TR-DETR on TVSum val set.
  • Figure 4: The impact of local-global alignment loss and $\lambda_{lg}$ based on QVHighlights val set, introducing audio features.