Table of Contents
Fetching ...

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Jia-Run Du, Jia-Chang Feng, Kun-Yu Lin, Fa-Ting Hong, Xiao-Ming Wu, Zhongang Qi, Ying Shan, Wei-Shi Zheng

TL;DR

This work addresses the Weakly-Supervised Temporal Action Localization from a novel category exclusion perspective, which gradually enhances the snippet-level supervision to bridge the gap between video-level supervision and unavailable snippet-level supervision.

Abstract

Weakly Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels. Due to the lack of snippet-level supervision for indicating action boundaries, previous methods typically assign pseudo labels for unlabeled snippets. However, since some action instances of different categories are visually similar, it is non-trivial to exactly label the (usually) one action category for a snippet, and incorrect pseudo labels would impair the localization performance. To address this problem, we propose a novel method from a category exclusion perspective, named Progressive Complementary Learning (ProCL), which gradually enhances the snippet-level supervision. Our method is inspired by the fact that video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by a complementary learning loss. And then, we introduce the background-aware pseudo complementary labeling in order to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on two popular benchmarks, namely THUMOS14 and ActivityNet1.3.

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

TL;DR

This work addresses the Weakly-Supervised Temporal Action Localization from a novel category exclusion perspective, which gradually enhances the snippet-level supervision to bridge the gap between video-level supervision and unavailable snippet-level supervision.

Abstract

Weakly Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels. Due to the lack of snippet-level supervision for indicating action boundaries, previous methods typically assign pseudo labels for unlabeled snippets. However, since some action instances of different categories are visually similar, it is non-trivial to exactly label the (usually) one action category for a snippet, and incorrect pseudo labels would impair the localization performance. To address this problem, we propose a novel method from a category exclusion perspective, named Progressive Complementary Learning (ProCL), which gradually enhances the snippet-level supervision. Our method is inspired by the fact that video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by a complementary learning loss. And then, we introduce the background-aware pseudo complementary labeling in order to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on two popular benchmarks, namely THUMOS14 and ActivityNet1.3.
Paper Structure (15 sections, 11 equations, 5 figures, 3 tables)

This paper contains 15 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of complementary categories. Given a video containing background and two action categories, i.e., "CricketShot" and "CricketBowling", it is clear that all snippets in the video surely do not belong to the categories "HighJump", "Diving", etc., which are termed deterministic complementary categories. Furthermore, for some snippets (red boxes), we would confidently exclude some categories that these snippets likely do not belong to, which are termed pseudo complementary categories (e.g., in the right red box, we easily exclude the category "CricketBowling" according to the visible cricket bat).
  • Figure 2: Overview of our proposed Progressive Complementary Learning (ProCL). Given a video, the class activation sequence $\mathcal{S}$ and the class-agnostic attention scores $\mathcal{A}$ are generated by the classification and the attention heads. ProCL progressively excludes categories that snippets should not belong to, for gradually enhancing snippet-level supervision. First, according to the video-level labels, we exclude the categories that all snippets surely do not belong to (i.e., deterministic complementary categories) by the Complementary Learning loss $L_{CL}$. Then, for snippets of less ambiguity, we exclude some categories that the snippets likely do not belong to (i.e., pseudo complementary categories) by the Multi-scale Pseudo Complementary Learning loss $L_{MPCL}$. Furthermore, for the remaining ambiguous snippets, we disambiguate them by the Foreground Background Discrimination loss $L_{FBD}$, which aims to coarsely provide further snippet-level supervision. Besides, the Multiple-Instance Learning loss $L_{MIL}$ is adopted for video-level supervision.
  • Figure 3: Visualization of ablation studies, where "+" indicates a new component is added upon the previous experiment. Red dash boxes note areas of significant improvement.
  • Figure 4: The label precision of different labeling methods, where PL denotes the pseudo labels obtained by following previous pseudo-label-based method he2022asm, PCL and MPCL denotes our pseudo complementary labels obtained from single-scale snippet sequence and multi-scale snippet sequences, respectively.
  • Figure 5: Visualization of category prediction scores of our and pseudo labeling. This figure shows a video containing only "ThrowDiscus", and each row in the figure indicates the category prediction scores of the snippets. The red dashed boxes refer to the regions where the pseudo-label-based method misclassifies, but our method still classifies correctly.