Table of Contents
Fetching ...

Efficient Temporal Action Segmentation via Boundary-aware Query Voting

Peiyao Wang, Yuewei Lin, Erik Blasch, Jie Wei, Haibin Ling

TL;DR

BaFormer reframes temporal action segmentation as sparse, per-segment classification using a boundary-aware Transformer that decouples segment masks from class predictions via instance queries and a global boundary query. A boundary-aware voting mechanism then consolidates per-query predictions into coherent, continuous segments, enabling a single-stage TAS pipeline with substantially reduced computation. Empirical results across GTEA, 50Salads, and Breakfast show competitive accuracy with dramatically lower FLOPs and runtime compared with state-of-the-art multi-stage methods and even single-stage baselines. The approach bridges MaskFormer-style segmentation concepts with TAS, delivering a practical, scalable solution for real-time or resource-constrained settings.

Abstract

Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.

Efficient Temporal Action Segmentation via Boundary-aware Query Voting

TL;DR

BaFormer reframes temporal action segmentation as sparse, per-segment classification using a boundary-aware Transformer that decouples segment masks from class predictions via instance queries and a global boundary query. A boundary-aware voting mechanism then consolidates per-query predictions into coherent, continuous segments, enabling a single-stage TAS pipeline with substantially reduced computation. Empirical results across GTEA, 50Salads, and Breakfast show competitive accuracy with dramatically lower FLOPs and runtime compared with state-of-the-art multi-stage methods and even single-stage baselines. The approach bridges MaskFormer-style segmentation concepts with TAS, delivering a practical, scalable solution for real-time or resource-constrained settings.

Abstract

Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.
Paper Structure (24 sections, 9 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 24 sections, 9 equations, 10 figures, 11 tables, 2 algorithms.

Figures (10)

  • Figure 1: Accuray vs. inference time on 50Salads. The bubble size represents the FLOPs in inference. Under different backbones, BaFormer enjoys the benefit of boundary-aware query voting with less running time and improved accuracy.
  • Figure 1: Comparative analysis of matching strategies on 50Salads. ($\#$Q: number of queries.)
  • Figure 2: An efficient pipeline developed by our proposed BaFormer.
  • Figure 3: Overview of BaFormer architecture. It predicts query classes and masks, along with boundaries from output heads. Although each layer in the Transformer decoder holds three heads, we illustrate the three heads in the last layer for simplicity.
  • Figure 3: Performance and efficiency of different voting strategies on 50Salads.
  • ...and 5 more figures