Table of Contents
Fetching ...

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li

TL;DR

The paper addresses the challenge of discriminative feature learning and head adaptation in Temporal Action Detection (TAD) by introducing Dynamic Feature Aggregation (DFA), which jointly adapts kernel weights and temporal receptive fields in a per-timestamp manner. DFA enables a Dynamic Encoder (DynE) and a Dynamic TAD Head (DyHead), forming the DyFADet framework that produces a discriminative, multi-scale feature pyramid for accurate action localization across varying durations. Across multiple benchmarks (including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-MQ1.0, and FineAction), DyFADet with DFA achieves state-of-the-art or competitive results, supported by extensive ablations and visualizations showing improved discriminability and dynamic head adaptation. The work provides a practical, code-released approach to dynamic temporal modeling, with potential to extend to broader video understanding tasks and improved efficiency via sparsity or further dynamic mechanisms.

Abstract

Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to https://github.com/yangle15/DyFADet-pytorch.

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

TL;DR

The paper addresses the challenge of discriminative feature learning and head adaptation in Temporal Action Detection (TAD) by introducing Dynamic Feature Aggregation (DFA), which jointly adapts kernel weights and temporal receptive fields in a per-timestamp manner. DFA enables a Dynamic Encoder (DynE) and a Dynamic TAD Head (DyHead), forming the DyFADet framework that produces a discriminative, multi-scale feature pyramid for accurate action localization across varying durations. Across multiple benchmarks (including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-MQ1.0, and FineAction), DyFADet with DFA achieves state-of-the-art or competitive results, supported by extensive ablations and visualizations showing improved discriminability and dynamic head adaptation. The work provides a practical, code-released approach to dynamic temporal modeling, with potential to extend to broader video understanding tasks and improved efficiency via sparsity or further dynamic mechanisms.

Abstract

Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to https://github.com/yangle15/DyFADet-pytorch.
Paper Structure (41 sections, 9 equations, 8 figures, 10 tables)

This paper contains 41 sections, 9 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Differences between convolution and DFA. (a) A normal convolution with static weights and receptive fields. (b) The dynamic formations of DFA at different timestamps. (c) Implementing DFA to build DyFADet can address the two issues in TAD.
  • Figure 2: An illustration of the proposed DFA module. (a) A normal convolution with the kernel size of 3 (Conv 3) and its corresponding formation realized by a shifting module and a point-wise convolution. (b) A DFA. The shifted representations multiplied with the weighted mask will be sent to the point-wise convolution. The DFA module is equivalent to a convolution with adaptive kernel weights and receptive fields.
  • Figure 3: Two different formations of the proposed DFA module.
  • Figure 4: (a) Overview of DyFADet. (b) The DynE layer consisting of the feature encoder. GN is Group-normalization wu2018group. (c) The multi-scale feature fusion in DyHead. (d) The classification and regression module obtains the classification and boundary results.
  • Figure 5: (a) The average cosine similarity between features at different timestamps in the same level among each encoder layer. (b) Similarity matrix of the extracted features among timestamps.
  • ...and 3 more figures