DyFADet: Dynamic Feature Aggregation for Temporal Action Detection
Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li
TL;DR
The paper addresses the challenge of discriminative feature learning and head adaptation in Temporal Action Detection (TAD) by introducing Dynamic Feature Aggregation (DFA), which jointly adapts kernel weights and temporal receptive fields in a per-timestamp manner. DFA enables a Dynamic Encoder (DynE) and a Dynamic TAD Head (DyHead), forming the DyFADet framework that produces a discriminative, multi-scale feature pyramid for accurate action localization across varying durations. Across multiple benchmarks (including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-MQ1.0, and FineAction), DyFADet with DFA achieves state-of-the-art or competitive results, supported by extensive ablations and visualizations showing improved discriminability and dynamic head adaptation. The work provides a practical, code-released approach to dynamic temporal modeling, with potential to extend to broader video understanding tasks and improved efficiency via sparsity or further dynamic mechanisms.
Abstract
Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to https://github.com/yangle15/DyFADet-pytorch.
