Table of Contents
Fetching ...

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

Elahe Vahdani, Yingli Tian

TL;DR

POTLoc addresses point-level temporal action localization by introducing a self-training pipeline that generates pseudo-labels from base-model proposals to guide a pseudo-label oriented multi-scale transformer and temporal feature pyramid. The method uses three enhanced losses and a sampling strategy to learn action dynamics with only point annotations, enabling robust modeling of actions with varying durations. Empirical evaluations on THUMOS'14 and ActivityNet-v1.2 show POTLoc achieving state-of-the-art performance among point- and weakly-supervised methods, with notable improvements on THUMOS'14 and solid gains on ActivityNet-v1.2. This approach reduces annotation costs while delivering accurate, complete action proposals, advancing practical TAL in unconstrained videos.

Abstract

This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

TL;DR

POTLoc addresses point-level temporal action localization by introducing a self-training pipeline that generates pseudo-labels from base-model proposals to guide a pseudo-label oriented multi-scale transformer and temporal feature pyramid. The method uses three enhanced losses and a sampling strategy to learn action dynamics with only point annotations, enabling robust modeling of actions with varying durations. Empirical evaluations on THUMOS'14 and ActivityNet-v1.2 show POTLoc achieving state-of-the-art performance among point- and weakly-supervised methods, with notable improvements on THUMOS'14 and solid gains on ActivityNet-v1.2. This approach reduces annotation costs while delivering accurate, complete action proposals, advancing practical TAL in unconstrained videos.

Abstract

This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
Paper Structure (14 sections, 10 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 14 sections, 10 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Framework overview. The modules outlined in gray and blue indicate the components of the base and our POTLoc model, respectively. (b) Pseudo-labels are generated from the noisy action proposals predicted by the base model on the training set. The proposals are refined and adjusted based on the point-labels and statistics of the proposals. (c) The pseudo-labels are sampled within a radius around the annotated points at each level $l$ of the pyramid and the block before the pyramid ($l=0$). This sampling helps to mitigate the addition of excessive noise during training, which could be caused by imprecise estimated action boundaries. (a,d) The multi-scale temporal transformer learns to model temporal dependencies and accommodate actions of varying duration when optimized with our enhanced losses, $\mathcal{L}^{\ast}_{\text{MIL}}$, $\mathcal{L}^{\ast}_{\text{Act}}$, and $\mathcal{L}^{\ast}_{\text{BG}}$ supervised with the pseudo-labels.
  • Figure 2: False negative profiling of ActionFormer zhang2022actionformer (fully-supervised), POTLoc (point-supervised) and the base model (point-supervised) on THUMOS14 using DETAD alwassel2018diagnosing.
  • Figure 3: False positive (FP) profiling of ActionFormer zhang2022actionformer (fully-supervised), POTLoc (point-supervised) and base model (point-supervised) on THUMOS14 using DETAD alwassel2018diagnosing.
  • Figure 4: Qualitative results on THUMOS'14. The ground-truth instances are highlighted in green. The detection results are displayed from: (1) the base model supervised with point-level annotations (blue), and (2) our POTLoc framework (orange). Transparent frames represent background frames.