Table of Contents
Fetching ...

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

Kun Xia, Le Wang, Sanping Zhou, Gang Hua, Wei Tang

TL;DR

The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes.

Abstract

The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

TL;DR

The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes.

Abstract

The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.
Paper Structure (17 sections, 12 equations, 7 figures, 5 tables)

This paper contains 17 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of unreliable predictions on an unlabeled video snippet. A common practice is to treat the action class with the highest confidence , i.e., "Putting on Shoes" as its target class for model optimization, while the ground truth label , i.e., "Sailing" is buried in the non-target classes.
  • Figure 2: An overview of our proposed Non-target Classes Learning framework. It follows the self-training paradigm, which iteratively uses the current model to assign pseudo labels to unlabeled videos and trains a new model on both the labeled videos and the pseudo-labeled videos. Given an unlabeled video snippet, the current model predicts a probability distribution of all classes. Our method adaptively partitions the label space $\Omega$ into a target class $\Omega^{tgt}$, positive classes $\Omega^{pos}$, negative classes $\Omega^{neg}$, and ambiguous classes $\Omega^{amb}$, by modeling both the confidence and rank of a class in relation to those of the target class. Based on the label space partition, we design the new positive learning loss $\ell_{pos}$ and negative learning loss $\ell_{neg}$ to mine positive and negative semantics that are absent in the target class, while excluding ambiguous classes.
  • Figure 3: Ablation study of SS-TAL results on THUMOS14 using I3D features and Actionformer zhang2022actionformer, where the label ratio is 10% and $\star$ represents only using labeled videos.
  • Figure 3: Effect of our method on foreground-background subtask. We present the visualization of foreground feature and background feature on an unlabeled THUMOS14 video.
  • Figure 4: Effect of our method on foreground-instance subtask. We present the visualization of features of four challenging classes on THUMOS14.
  • ...and 2 more figures