Table of Contents
Fetching ...

Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

TL;DR

This work tackles Weakly-supervised Temporal Action Localization by addressing the gap between classification-driven pseudo labels and the localization goal. It introduces FuSTAL, a full-stage framework that enhances pseudo label quality across Generation-, Selection-, and Training-Stages via cross-video contrastive proposal generation, prior-based filtering, and EMA-based distillation for smoother, more complete action proposals. Empirical results on THUMOS'14 and ActivityNet v1.3 show FuSTAL achieving a new state-of-the-art 50.8% average mAP on THUMOS'14 and 28.4% on ActivityNet v1.3, including the milestone of surpassing 50% average mAP. These findings demonstrate the effectiveness of multi-stage pseudo label quality improvements and offer a practical approach for robust weakly-supervised action localization.

Abstract

Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.

Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

TL;DR

This work tackles Weakly-supervised Temporal Action Localization by addressing the gap between classification-driven pseudo labels and the localization goal. It introduces FuSTAL, a full-stage framework that enhances pseudo label quality across Generation-, Selection-, and Training-Stages via cross-video contrastive proposal generation, prior-based filtering, and EMA-based distillation for smoother, more complete action proposals. Empirical results on THUMOS'14 and ActivityNet v1.3 show FuSTAL achieving a new state-of-the-art 50.8% average mAP on THUMOS'14 and 28.4% on ActivityNet v1.3, including the milestone of surpassing 50% average mAP. These findings demonstrate the effectiveness of multi-stage pseudo label quality improvements and offer a practical approach for robust weakly-supervised action localization.

Abstract

Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.
Paper Structure (16 sections, 10 equations, 6 figures, 4 tables)

This paper contains 16 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Unlike classical two-stage pseudo label learning framework for Weakly-supervised Temporal Action Localization (WSTAL), we excavate potential stages in the framework and enhance pseudo label quality at full-stage.
  • Figure 2: Overview of FuSTAL : (a) Generation-Stage: An in-video and a cross-video contrastive losses are applied on mined snippets to help excavate the essential characteristics, thus generating more informative action proposals. Triangles and circles in different colors and patterns refer to hard and easy embeddings from different videos. (b) Selection-Stage: The initial proposals are gathered to calculate an IoU score for each. Only proposals with score higher than thresholds would be kept as pseudo labels. (c) Training-Stage: A regression-based student model is trained with selected action proposals in supervised manner. Meanwhile, an EMA model is updated, and switches to become new label generator once original proposals reach their ceiling. Only the trained regression-based model is used for inference.
  • Figure 3: In "Soccer Penalty" video, background snippets after kicking (blue box) are similar with previous penalty-kicking action. But with another video as reference, this similarity would be highly inhibited, thus excavating the essential characteristics of the action --- run-ups and kicks.
  • Figure 4: The proposals around ground-truth action segments tend to be more dense than those around backgrounds.
  • Figure 5: Qualitative comparison with ground-truth and baseline method on a 'Clean and Jerk' video. FuSTAL produces more continuous and accurate action snippets.
  • ...and 1 more figures