Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction
Quan Zhang, Yuxin Qi, Xi Tang, Rui Yuan, Xi Lin, Ke Zhang, Chun Yuan
TL;DR
This work addresses weakly supervised temporal action localization (WTAL) by treating pseudo-labels as noisy signals that can mislead detectors. It introduces NoCo, a two-stage noise-correction framework consisting of CALA, which refines pseudo-label boundaries using context, and an online teacher-student apparatus with Missing Instance Compensation (MIC), Ambiguous Instance Correction (AIC), and High-Quality Pseudo-Label Mining (HPM) to iteratively correct label noise and reweight supervision. The approach is decoupled from WTAL baselines, enabling plug-and-play integration, and yields state-of-the-art results on THUMOS14 and ActivityNet v1.2 with substantially faster inference by running only the student model during testing. The combination of boundary refinement, online noise correction, and adaptive loss weighting demonstrates strong generalizability across WTAL models and provides a practical path toward robust, efficient weakly supervised TAL.
Abstract
Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
