Table of Contents
Fetching ...

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, Qingming Huang

TL;DR

This paper designs a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization, and designs three self-supervised temporal understanding tasks that help a model understand the temporal consistency of actions across videos.

Abstract

Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

TL;DR

This paper designs a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization, and designs three self-supervised temporal understanding tasks that help a model understand the temporal consistency of actions across videos.

Abstract

Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
Paper Structure (18 sections, 18 equations, 4 figures, 8 tables)

This paper contains 18 sections, 18 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Previous methods primarily use point annotations to supervise the action classification of labeled snippets (Task 1). We improve upon this by incorporating three point-based self-supervised tasks (Tasks 2, 3, and 4) that enable the model to better capture temporal consistency and understand action sequences, thereby boosting action localization.
  • Figure 2: The workflow of the proposed method. The RGB and optical flow snippets of the input video are fed into the pretrained feature extractor to generate features $\mathbf{F}$, and are further embedded as $\mathbf{X}$. Besides the conventional snippet classification task, the embedded features $\mathbf{X}$ are also used for joint training of the newly designed three temporal consistency tasks: Action Completion (AC), Action Order Understanding (AOU), and Action Regularity Understanding (ARU).
  • Figure 3: Qualitative comparison between our proposed method, HR-Pro HR-Pro, and LACP LACP on ActivityNet 1.3. We provide four cases of temporal action localization from different action classes, including "Fixing bicycle", "Powerbocking", "Shoveling snow", and "Hurling". Prediction errors are highlighted with red dashed boxes. The IoUs between our detection results and the ground truths are notably higher.
  • Figure 4: T-SNE visualization of the distribution of (a) raw features and (b) embedded features, respectively, where dots represent snippet features and different colors indicate different action categories.