Table of Contents
Fetching ...

STAT: Towards Generalizable Temporal Action Localization

Yangcen Liu, Ziyi Liu, Yuanhao Zhai, Wen Li, David Doerman, Junsong Yuan

TL;DR

This work addresses the generalization gap in weakly-supervised temporal action localization by introducing GTAL, which analyzes cross-distribution performance and identifies localization and scale variation as primary challenges. It proposes STAT, a self-supervised teacher–student framework featuring a temporal refinement module to adapt attention scales and an alignment module to harmonize teacher–student outputs, enabling robust CrD localization. Through two-stage training and cross-dataset evaluation on THUMOS14, ActivityNet1.2, and HACS, STAT achieves significant CrD improvements that approach SmD performance, demonstrating strong generalization across diverse data distributions. The approach offers a practical path toward real-world TAL deployment under distribution shifts, while acknowledging remaining challenges in maintaining uniform performance across SmD and CrD and in leveraging class-aware annotations.

Abstract

Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels. Despite the significant progress, existing methods suffer from severe performance degradation when transferring to different distributions and thus may hardly adapt to real-world scenarios . To address this problem, we propose the Generalizable Temporal Action Localization task (GTAL), which focuses on improving the generalizability of action localization methods. We observed that the performance decline can be primarily attributed to the lack of generalizability to different action scales. To address this problem, we propose STAT (Self-supervised Temporal Adaptive Teacher), which leverages a teacher-student structure for iterative refinement. Our STAT features a refinement module and an alignment module. The former iteratively refines the model's output by leveraging contextual information and helps adapt to the target scale. The latter improves the refinement process by promoting a consensus between student and teacher models. We conduct extensive experiments on three datasets, THUMOS14, ActivityNet1.2, and HACS, and the results show that our method significantly improves the Baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution evaluation performance.

STAT: Towards Generalizable Temporal Action Localization

TL;DR

This work addresses the generalization gap in weakly-supervised temporal action localization by introducing GTAL, which analyzes cross-distribution performance and identifies localization and scale variation as primary challenges. It proposes STAT, a self-supervised teacher–student framework featuring a temporal refinement module to adapt attention scales and an alignment module to harmonize teacher–student outputs, enabling robust CrD localization. Through two-stage training and cross-dataset evaluation on THUMOS14, ActivityNet1.2, and HACS, STAT achieves significant CrD improvements that approach SmD performance, demonstrating strong generalization across diverse data distributions. The approach offers a practical path toward real-world TAL deployment under distribution shifts, while acknowledging remaining challenges in maintaining uniform performance across SmD and CrD and in leveraging class-aware annotations.

Abstract

Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels. Despite the significant progress, existing methods suffer from severe performance degradation when transferring to different distributions and thus may hardly adapt to real-world scenarios . To address this problem, we propose the Generalizable Temporal Action Localization task (GTAL), which focuses on improving the generalizability of action localization methods. We observed that the performance decline can be primarily attributed to the lack of generalizability to different action scales. To address this problem, we propose STAT (Self-supervised Temporal Adaptive Teacher), which leverages a teacher-student structure for iterative refinement. Our STAT features a refinement module and an alignment module. The former iteratively refines the model's output by leveraging contextual information and helps adapt to the target scale. The latter improves the refinement process by promoting a consensus between student and teacher models. We conduct extensive experiments on three datasets, THUMOS14, ActivityNet1.2, and HACS, and the results show that our method significantly improves the Baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution evaluation performance.
Paper Structure (14 sections, 8 equations, 5 figures, 4 tables)

This paper contains 14 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Performance comparison between same-distribution (SmD) and cross-distribution (CrD) evaluation. Left: Under the CrD setting, state-of-the-art methods demonstrate significant performance degradation. Right: Compared to the SmD results, the CrD results appear to be more fragmented, highlighting a lack of adaptability in these methods to the temporal scale variations of the target distribution.
  • Figure 2: Snippet classification results and dataset statistics.
  • Figure 3: DETAD diagnose analysis of DELU prediction. Left: SmD results. Right: CrD results. Compared with SmD, CrD predictions contain much more localization error.
  • Figure 4: Overall framework of our Self-supervised Temporal Adaptive Teacher. The pipeline is built on the mean teacher framework. First, the teacher model and the student model are initialized with the SmD model. Then, for each input video, the teacher model and the student model separately predict the CAS and attention. After that, the predicted attention from the teacher is refined in the temporal refinement module. Finally, the output of the student and teacher model is aligned in the alignment module, guiding the student to adapt to the target scale.
  • Figure 5: Ablation study on the refinement parameter $\alpha$. Left: THUMOS14 to ActivityNet1.2. Right: ActivityNet1.2 to THUMOS14.