Table of Contents
Fetching ...

Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment

Angchi Xu, Wei-Shi Zheng

TL;DR

The paper targets weakly-supervised action segmentation by eliminating costly frame-by-frame alignments and instead focusing on detecting a small set of action transitions. It introduces ATBA, which combines class-agnostic boundary cues and transition-specific patterns into a DP-based alignment that selects the most plausible boundaries, thereby generating reliable pseudo labels for training. Complementary video-level losses strengthen semantic learning under pseudo-label noise, and a pyramid temporal network enables efficient long-video processing. Empirical results show state-of-the-art or competitive performance with significantly faster training and inference, highlighting ATBA's practical impact for WSAS in instructional and cinematic videos.

Abstract

Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts (ordered list of actions). Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript, which is time-consuming and hard to be parallelized while training. In this work, we aim to escape from this inefficient alignment with massive but redundant frames, and instead to directly localize a few action transitions for pseudo segmentation generation, where a transition refers to the change from an action segment to its next adjacent one in the transcript. As the true transitions are submerged in noisy boundaries due to intra-segment visual variation, we propose a novel Action-Transition-Aware Boundary Alignment (ATBA) framework to efficiently and effectively filter out noisy boundaries and detect transitions. In addition, to boost the semantic learning in the case that noise is inevitably present in the pseudo segmentation, we also introduce video-level losses to utilize the trusted video-level supervision. Extensive experiments show the effectiveness of our approach on both performance and training speed.

Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment

TL;DR

The paper targets weakly-supervised action segmentation by eliminating costly frame-by-frame alignments and instead focusing on detecting a small set of action transitions. It introduces ATBA, which combines class-agnostic boundary cues and transition-specific patterns into a DP-based alignment that selects the most plausible boundaries, thereby generating reliable pseudo labels for training. Complementary video-level losses strengthen semantic learning under pseudo-label noise, and a pyramid temporal network enables efficient long-video processing. Empirical results show state-of-the-art or competitive performance with significantly faster training and inference, highlighting ATBA's practical impact for WSAS in instructional and cinematic videos.

Abstract

Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts (ordered list of actions). Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript, which is time-consuming and hard to be parallelized while training. In this work, we aim to escape from this inefficient alignment with massive but redundant frames, and instead to directly localize a few action transitions for pseudo segmentation generation, where a transition refers to the change from an action segment to its next adjacent one in the transcript. As the true transitions are submerged in noisy boundaries due to intra-segment visual variation, we propose a novel Action-Transition-Aware Boundary Alignment (ATBA) framework to efficiently and effectively filter out noisy boundaries and detect transitions. In addition, to boost the semantic learning in the case that noise is inevitably present in the pseudo segmentation, we also introduce video-level losses to utilize the trusted video-level supervision. Extensive experiments show the effectiveness of our approach on both performance and training speed.
Paper Structure (23 sections, 10 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of performance and training time of WSAS methods on the Breakfast dataset. MoF-The main metric of the task, the higher the better. *-Alignment-free methods. Our ATBA achieves the best performance with a very short training time.
  • Figure 2: The necessity of proposed ATBA. The example is P54-webcam01-P54-coffee in Breakfast dataset. GT-The ground-truth segmentation. C.A.Bdy.-Only class-agnostic boundary detection is applied (Exp.1 of \ref{['tab:atba']}). Acc.-The accuracy of pseudo segmentation. In the video clip around the “star” point, the coffee pot undergoes a change from being picked up to tilted pouring within the segment "Pour Coffee", and this noisy visual change is incorrectly detected. In addition, although two boundaries are correctly detected by the "C.A.Bdy." (diamonds), they correspond to incorrect transitions due to one false positive error (star), resulting in complete dislocation of segments within the dashed box. Best viewed in color.
  • Figure 3: The overall framework. We propose an Action-Transition-Aware Boundary Alignment (ATBA) framework, which takes the class-agnostic boundary pattern and action transition pattern together into account to efficiently generate pseudo labels. The trusted video-level supervision is also utilized to further enhance the performance.
  • Figure 4: (a) A 7$\times$7 template for class-agnostic boundary scoring. (b) A 2$\times$7 template for action transition scoring.
  • Figure 5: Illustration of the alignment between action transitions $\mathcal{R}$ and candidate boundaries $\widetilde{\mathcal{B}}$. Blue circles are aligned and gray ones are dropped. (a) A valid alignment. (b) An invalid alignment. The red dashed arrow violates the ordering consistency. Best viewed in color.
  • ...and 4 more figures