Table of Contents
Fetching ...

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma

TL;DR

This paper addresses the complementary tasks of Temporal Action Detection (TAD) and Moment Retrieval (MR) by proposing UniMD, a unified model that performs both tasks within a single moment-detection framework. It leverages CLIP-based text embeddings and two query-dependent heads (classification and regression) to produce uniform outputs for predefined actions and open-ended events, enabling open-vocabulary perception. The authors explore task fusion through pre-training and, more notably, synchronized co-training, demonstrating mutual improvements and data-efficient gains across the Ego4D, Charades, and ActivityNet benchmarks. The results establish state-of-the-art performance for both TAD and MR within a single model and highlight practical benefits for deploying unified video understanding systems.

Abstract

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

TL;DR

This paper addresses the complementary tasks of Temporal Action Detection (TAD) and Moment Retrieval (MR) by proposing UniMD, a unified model that performs both tasks within a single moment-detection framework. It leverages CLIP-based text embeddings and two query-dependent heads (classification and regression) to produce uniform outputs for predefined actions and open-ended events, enabling open-vocabulary perception. The authors explore task fusion through pre-training and, more notably, synchronized co-training, demonstrating mutual improvements and data-efficient gains across the Ego4D, Charades, and ActivityNet benchmarks. The results establish state-of-the-art performance for both TAD and MR within a single model and highlight practical benefits for deploying unified video understanding systems.

Abstract

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.
Paper Structure (14 sections, 10 equations, 3 figures, 4 tables)

This paper contains 14 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our proposed model, UniMD, can simultaneously perform TAD and MR. When co-trained using a fraction of the training data, it can even achieve superior performance compared to dedicated models, such as 25% training data in MR and 50% in TAD.
  • Figure 2: The mutual benefits of TAD and MR tasks. The queries in green belong to MR and that in blue are categories of TAD. The queries from MR help establish dependencies between actions like (a) co-occurrence, and (b) order. The instances from TAD can (c) act as negative samples, and (d) provide more events for MR.
  • Figure 3: Overview of UniMD. The network is designed to process moment detection by treating each TAD category as an independent natural language query. The video features are fed into the vision encoder and BiFPN to draw multi-scale features. The text embeddings are then streamed into the decoder, enabling the calculation of foreground confidence for each time step and the onset and offset of the actions. The classification head utilizes textual embeddings as classifiers and the regression head employs the transformation of textual embeddings as convolutional kernel.