Table of Contents
Fetching ...

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Feixiang Zhou, Bryan Williams, Hossein Rahmani

TL;DR

This work tackles noisy pseudo labels in semi-supervised temporal action localization by introducing Adaptive Pseudo-label Learning (APL), which jointly scores pseudo-label quality through Adaptive Label Quality Assessment (ALQA) and refines selections via Instance-level Consistency Discriminator (ICD). ALQA combines localization reliability with classification confidence by predicting $P_{tiou}$ and $P_{tnd}$ and forming a joint score $\,\hat{P}=\hat{P}_{diou}\odot\hat{P}_{cls}$, using a DIoU-based soft label for classification; pseudo-labels are dynamically chosen using thresholds and Soft-NMS. ICD leverages inter-instance consistency to filter ambiguous positives and mine potential positives by computing similarity scores between predicted instances and labeled examples with a learned discriminator $\mathcal{D}$. Action-aware Contrastive Pre-training (ACP) provides unsupervised, multi-scale frame-level representations via coarse- and fine-grained contrasts to improve discrimination between actions and backgrounds and among actions. Across THUMOS14 and ActivityNet v1.3, the method achieves new state-of-the-art results under multiple labeling ratios, validating improvements in pseudo-label quality and representation learning. The combination of ALQA, ICD, and ACP offers a robust, end-to-end approach to semi-supervised temporal action localization with practical impact on reducing annotation costs while maintaining high accuracy.

Abstract

Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi-supervised settings.

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

TL;DR

This work tackles noisy pseudo labels in semi-supervised temporal action localization by introducing Adaptive Pseudo-label Learning (APL), which jointly scores pseudo-label quality through Adaptive Label Quality Assessment (ALQA) and refines selections via Instance-level Consistency Discriminator (ICD). ALQA combines localization reliability with classification confidence by predicting and and forming a joint score , using a DIoU-based soft label for classification; pseudo-labels are dynamically chosen using thresholds and Soft-NMS. ICD leverages inter-instance consistency to filter ambiguous positives and mine potential positives by computing similarity scores between predicted instances and labeled examples with a learned discriminator . Action-aware Contrastive Pre-training (ACP) provides unsupervised, multi-scale frame-level representations via coarse- and fine-grained contrasts to improve discrimination between actions and backgrounds and among actions. Across THUMOS14 and ActivityNet v1.3, the method achieves new state-of-the-art results under multiple labeling ratios, validating improvements in pseudo-label quality and representation learning. The combination of ALQA, ICD, and ACP offers a robust, end-to-end approach to semi-supervised temporal action localization with practical impact on reducing annotation costs while maintaining high accuracy.

Abstract

Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi-supervised settings.
Paper Structure (10 sections, 12 equations, 8 figures, 13 tables)

This paper contains 10 sections, 12 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The overview of the proposed APL framework. We first leverage both labeled and unlabeled videos for ACP without using any GT labels, which enhances the frame-level representation. We then update the pre-trained model by using a small amount of labeled videos and generate pseudo labels for unlabeled videos, where ALQA jointly learns classification confidence $\hat{P}_{cls}\in \mathbb{R}^{K \times T}$ and localization reliability $\hat{P}_{diou}\in \mathbb{R}^{1 \times T}$ before dynamically selecting pseudo labels according to their joint score. Finally, we propose an ICD to refine the pseudo-label selection by removing ambiguous positives and mining potential positives.
  • Figure 2: Illustration of (a) Adaptive Label Quality Assessment and (b) Instance-level Consistency Discrimination. (a) We evaluate localization reliability by designing two parallel branches (heads) to predict tIoU and tND, respectively, leading to a joint score of classification and localization. (b) We aggregate temporal features of labeled action instances using Maxpooling, then use a discriminator $\mathcal{D}$ to learn the similarity probability between two instance pairs. During inference, $\mathcal{D}$ provides similarity scores between predicted instances and labeled instances of the same action category.
  • Figure 3: (a) and (b) The effect of different hyperparameters (i.e., $\tau_{icd}$ and $\varsigma_{icd}$) settings. (c) Ablation studies on the quality of pseudo labels when using 10% labeled videos. Class Acc: action classification accuracy. Avg tIoU: average tIoU. Pos Acc: accuracy of positive predictions.
  • Figure 4: The effect of our ACP on THUMOS14. (a) t-SNE visualization of action and background features. (b) t-SNE visualization of features for different actions. The legend for different actions is provided in the supplementary material.
  • Figure S1: Illustration of our Action-aware Contrastive Pre-training. We first obtain temporal representation encoded by the base model (e.g., ActionFormer zhang2022actionformer). In coarse-grained contrast, two videos within a mini-batch are sampled to form representation set $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$, respectively and we cluster the corresponding input features to generate frame-wise clustering labels with only 2 action categories (0-action,1-background). We then contrast between actions and backgrounds to attract similar representations and repel different representations. In fine-grained contrast, we contrast more between different kinds of actions based on the combined representation set $\{\mathcal{F}_{1},\mathcal{F}_{2}\}$.
  • ...and 3 more figures