Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
Ziyi Liu, Yangcen Liu
TL;DR
The paper tackles WTAL by bridging to fully-supervised TAL with PseudoFormer, a two-branch framework that combines MIL-based weak supervision with a regression-based full branch. It introduces RickerFusion to fuse weak-branch outputs into high-quality pseudo labels, employs an uncertainty mask to mitigate noise during training, and uses a teacher-student EMA regime to refine learning. The approach leverages both snippet-level and proposal-level priors, achieving state-of-the-art WTAL performance on THUMOS14 and ActivityNet1.3, and even approaching fully-supervised performance on some benchmarks. This work advances practical temporal action localization under varying supervision levels and suggests a path toward unified frameworks that exploit priors across supervision regimes.
Abstract
Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.
