Boundary-Recovering Network for Temporal Action Detection
Jihwan Kim, Jaehyun Choi, Yerim Jeon, Jae-Pil Heo
TL;DR
Temporal action detection must localize actions across varied durations in untrimmed videos, but a vanishing boundary problem (VBP) caused by pooling in coarse-to-fine feature pyramids degrades boundary cues and increases false positives. The authors propose Boundary-Recovering Network (BRN), introducing Scale-Time Representations by interpolating multi-scale features to a common temporal length and stacking them along a scale axis to form STF, plus Scale-Time Blocks (STB) that perform cross-scale exchange via scale and time dilated convolutions with a selection module. BRN is evaluated on THUMOS14 and ActivityNet-v1.3 using I3D features with FCOS and ActionFormer backbones, achieving state-of-the-art results and notably improving boundary localization for neighboring short instances. The key contributions include the explicit identification of VBP, the STF/STB framework for boundary recovery, and comprehensive ablations demonstrating the importance of cross-scale feature exchange for robust multi-scale TAD.
Abstract
Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call 'vanishing boundary problem'. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem.
