Table of Contents
Fetching ...

Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer

Ziyi Liu, Yangcen Liu

TL;DR

The paper tackles WTAL by bridging to fully-supervised TAL with PseudoFormer, a two-branch framework that combines MIL-based weak supervision with a regression-based full branch. It introduces RickerFusion to fuse weak-branch outputs into high-quality pseudo labels, employs an uncertainty mask to mitigate noise during training, and uses a teacher-student EMA regime to refine learning. The approach leverages both snippet-level and proposal-level priors, achieving state-of-the-art WTAL performance on THUMOS14 and ActivityNet1.3, and even approaching fully-supervised performance on some benchmarks. This work advances practical temporal action localization under varying supervision levels and suggests a path toward unified frameworks that exploit priors across supervision regimes.

Abstract

Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.

Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer

TL;DR

The paper tackles WTAL by bridging to fully-supervised TAL with PseudoFormer, a two-branch framework that combines MIL-based weak supervision with a regression-based full branch. It introduces RickerFusion to fuse weak-branch outputs into high-quality pseudo labels, employs an uncertainty mask to mitigate noise during training, and uses a teacher-student EMA regime to refine learning. The approach leverages both snippet-level and proposal-level priors, achieving state-of-the-art WTAL performance on THUMOS14 and ActivityNet1.3, and even approaching fully-supervised performance on some benchmarks. This work advances practical temporal action localization under varying supervision levels and suggests a path toward unified frameworks that exploit priors across supervision regimes.

Abstract

Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.

Paper Structure

This paper contains 12 sections, 11 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Our paper aims at addressing the three primary questions: (a) How to improve the quality of generated labels from the base model? (b) What priors could the regression model learn from, and how could it better use them? (c) How to train with noisy labels in uncertainty?
  • Figure 2: Overall framework of PseudoFormer. (a) Weak Branch: After feature extraction, the base model predicts agnostic attention and classification scores. Using Multi-instance Learning (MIL) with video-level labels, it outputs proposals and the snippt-level predictions (SPs). (b) Full Branch: A regression-based model is trained on the snippt-level predictions (SPs), pseudo proposals, and uncertainty mask, and is used for final inference. (c) RickerFusion: To improve the quality of pseudo labels, RickerFusion maps predictions across perception scales into a shared space, producing better pseudo labels by fusing predictions from the weak branch. (d) Mask Generation: To train with noisy pseudo labels, an uncertainty mask is applied to proposal boundaries, and the uncertain regions gradually decrease after the warm-up epoch. The pseudo proposals and the uncertainty mask are iteratively refined during training.
  • Figure 3: Visualization for ground truth on a test video, base model (DELU), after RickerFusion and PseudoFormer. For Base Model and PseudoFormer, we visualize the top-4 proposals overlapping the ground truth. With the input video, PseudoFormer produces more consistent and accurate predictions.