Table of Contents
Fetching ...

Dual DETRs for Multi-Label Temporal Action Detection

Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, Limin Wang

TL;DR

DualDETR addresses multi-label Temporal Action Detection by introducing dual-level queries for instance- and boundary-level reasoning within a two-branch decoding framework. A joint initialization aligns encoder proposals with both query types, and a mutual refinement module enables complementary propagation of information, enabling accurate boundary localization without NMS. The method achieves leading det-mAP on MultiTHUMOS, Charades, and TSU, while maintaining competitive seg-mAP and efficient convergence, demonstrating strong performance in densely overlapping action scenarios. By integrating boundary-aware and content-aware reasoning, DualDETR advances boundary precision and recognition in complex, multi-label video understanding tasks.

Abstract

Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels, facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design, we present a joint query initialization strategy to align queries from both levels. Specifically, we leverage encoder proposals to match queries from each level in a one-to-one manner. Then, the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods, achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.

Dual DETRs for Multi-Label Temporal Action Detection

TL;DR

DualDETR addresses multi-label Temporal Action Detection by introducing dual-level queries for instance- and boundary-level reasoning within a two-branch decoding framework. A joint initialization aligns encoder proposals with both query types, and a mutual refinement module enables complementary propagation of information, enabling accurate boundary localization without NMS. The method achieves leading det-mAP on MultiTHUMOS, Charades, and TSU, while maintaining competitive seg-mAP and efficient convergence, demonstrating strong performance in densely overlapping action scenarios. By integrating boundary-aware and content-aware reasoning, DualDETR advances boundary precision and recognition in complex, multi-label video understanding tasks.

Abstract

Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels, facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design, we present a joint query initialization strategy to align queries from both levels. Specifically, we leverage encoder proposals to match queries from each level in a one-to-one manner. Then, the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods, achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.
Paper Structure (18 sections, 12 equations, 7 figures, 9 tables)

This paper contains 18 sections, 12 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: DualDETR operates at both the instance level and boundary level (start, end) by using two groups of queries, with each group corresponding to one level. To capture specific semantics at each level, we introduce a two-branch decoding structure. This structure separates the decoding process for each level, allowing queries from each group to focus on their corresponding encoder feature map. Furthermore, we propose a query alignment strategy equipped with joint initialization. This strategy aligns the queries from two groups by matching them with the same detection goal, as denoted by the bidirectional arrow.
  • Figure 2: Pipeline of DualDETR. The pre-extracted video features, augmented with the positional embedding, pass through a transformer encoder to produce the encoder feature map. This map is divided along the channel dimension into separate feature maps for the boundary-level (start, end) and instance-level modeling, respectively. An auxiliary dense detection head is applied to generate encoder proposals and scores. Upon this, decoder queries are constructed using the query alignment strategy. The decoding process is performed at dual levels. Thanks to the query alignment, dual-level queries can perform a complementary refinement through the mutual refinement module. Finally, DualDETR directly output action instance predictions without NMS post-processing.
  • Figure 3: Query Alignment with Joint Initialization. (a) Instance queries and boundary queries are aligned to match with the encoder predictions in a one-to-one manner. (b) The matched encoder prediction serves as the initialization for dual-level queries.
  • Figure 4: Convergence curves of DualDETR , PointTAD pointtad, and ActionFormer actionformer on MultiTHUMOS.
  • Figure 5: Comparison of detection mAP at each decoder layer for different query initializations. All initialization strategies are re-implemented in the DualDETR framework. The joint initialization showcased strong detection performance from early decoding stages and continues to outperform other initialization variants as the number of decoder layers increases.
  • ...and 2 more figures