Table of Contents
Fetching ...

Boundary-Recovering Network for Temporal Action Detection

Jihwan Kim, Jaehyun Choi, Yerim Jeon, Jae-Pil Heo

TL;DR

Temporal action detection must localize actions across varied durations in untrimmed videos, but a vanishing boundary problem (VBP) caused by pooling in coarse-to-fine feature pyramids degrades boundary cues and increases false positives. The authors propose Boundary-Recovering Network (BRN), introducing Scale-Time Representations by interpolating multi-scale features to a common temporal length and stacking them along a scale axis to form STF, plus Scale-Time Blocks (STB) that perform cross-scale exchange via scale and time dilated convolutions with a selection module. BRN is evaluated on THUMOS14 and ActivityNet-v1.3 using I3D features with FCOS and ActionFormer backbones, achieving state-of-the-art results and notably improving boundary localization for neighboring short instances. The key contributions include the explicit identification of VBP, the STF/STB framework for boundary recovery, and comprehensive ablations demonstrating the importance of cross-scale feature exchange for robust multi-scale TAD.

Abstract

Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call 'vanishing boundary problem'. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem.

Boundary-Recovering Network for Temporal Action Detection

TL;DR

Temporal action detection must localize actions across varied durations in untrimmed videos, but a vanishing boundary problem (VBP) caused by pooling in coarse-to-fine feature pyramids degrades boundary cues and increases false positives. The authors propose Boundary-Recovering Network (BRN), introducing Scale-Time Representations by interpolating multi-scale features to a common temporal length and stacking them along a scale axis to form STF, plus Scale-Time Blocks (STB) that perform cross-scale exchange via scale and time dilated convolutions with a selection module. BRN is evaluated on THUMOS14 and ActivityNet-v1.3 using I3D features with FCOS and ActionFormer backbones, achieving state-of-the-art results and notably improving boundary localization for neighboring short instances. The key contributions include the explicit identification of VBP, the STF/STB framework for boundary recovery, and comprehensive ablations demonstrating the importance of cross-scale feature exchange for robust multi-scale TAD.

Abstract

Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call 'vanishing boundary problem'. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem.
Paper Structure (14 sections, 9 equations, 7 figures, 6 tables)

This paper contains 14 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of the vanishing boundary problem. In our intuitive example, three action instances go through coarse-to-fine pyramid network with pooling operations. The features of background frames can easily dim out via pooling due to absence of clear patterns related to the action. However, this naive pooling process introduces vanishing boundary problem when the backgrounds exist between short action instances, labeled as important background like the points of 'A' and 'B'. As a result, the problem can cause the model to predict long false positives in coarser-levels due to temporal ambiguity of action boundaries.
  • Figure 2: Overall architecture of Boundary-Recovering Network (BRN). First, the features for a video from a pre-trained 3D CNN are fed into the backbone network to construct multi-scale features. Second, simple interpolation builds scale-time features. Finally, the scale-time blocks learn to exchange features over scales to recover the boundary patterns.
  • Figure 3: Scale-Time Blocks. Scale and time sub-blocks have dilated convolutions with different rates and kernel sizes. Afterwards, outputs $O_i$ are aggregated with a selection module, an attention-based pooling as in Eq. \ref{['eq:STB_GAP']}, \ref{['eq:STB_ATT']}, \ref{['eq:STB_Out']}.
  • Figure 4: Examples of selection weights. The figure contains illustration of two examples with the selection weights of the final scale convolution block on test samples in ActivityNet-v1.3.
  • Figure 5: Visualization related to the vanishing boundary problem. The figure shows samples from the validation set of ActivityNet-v1.3. As seen, FCOS produces longer false positives for neighboring short instances due to the vanishing boundary problem. However, our model of the FCOS backbone precisely localizes them.
  • ...and 2 more figures