Table of Contents
Fetching ...

Faster Diffusion Action Segmentation

Shuaibing Wang, Shunli Wang, Mingcheng Li, Dingkang Yang, Haopeng Kuang, Ziyun Qian, Lihua Zhang

TL;DR

This paper addresses Temporal Action Segmentation (TAS) with diffusion-based methods, which are accurate but computationally intensive due to many sampling steps. It introduces EffiDiffAct, a system that combines a lightweight Temporal Dilation Perception (TDP) encoder with an adaptive skip strategy to accelerate inference while preserving high segmentation quality. Through extensive experiments on 50Salads, Breakfast, and GTEA, EffiDiffAct demonstrates strong performance gains on large datasets and notable efficiency improvements, aided by ablations that highlight the benefits of the TDP encoder and adaptive sampling. The approach offers a practical path toward real-time TAS with diffusion models by reducing computational burden without sacrificing accuracy.

Abstract

Temporal Action Segmentation (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. However, the ambiguous boundaries between actions pose a significant challenge for high-precision segmentation. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. However, the heavy sampling steps required by diffusion models pose a substantial computational burden, limiting their practicality in real-time applications. Additionally, most related works utilize Transformer-based encoder architectures. Although these architectures excel at capturing long-range dependencies, they incur high computational costs and face feature-smoothing issues when processing long video sequences. To address these challenges, we propose EffiDiffAct, an efficient and high-performance TAS algorithm. Specifically, we develop a lightweight temporal feature encoder that reduces computational overhead and mitigates the rank collapse phenomenon associated with traditional self-attention mechanisms. Furthermore, we introduce an adaptive skip strategy that allows for dynamic adjustment of timestep lengths based on computed similarity metrics during inference, thereby further enhancing computational efficiency. Comprehensive experiments on the 50Salads, Breakfast, and GTEA datasets demonstrated the effectiveness of the proposed algorithm.

Faster Diffusion Action Segmentation

TL;DR

This paper addresses Temporal Action Segmentation (TAS) with diffusion-based methods, which are accurate but computationally intensive due to many sampling steps. It introduces EffiDiffAct, a system that combines a lightweight Temporal Dilation Perception (TDP) encoder with an adaptive skip strategy to accelerate inference while preserving high segmentation quality. Through extensive experiments on 50Salads, Breakfast, and GTEA, EffiDiffAct demonstrates strong performance gains on large datasets and notable efficiency improvements, aided by ablations that highlight the benefits of the TDP encoder and adaptive sampling. The approach offers a practical path toward real-time TAS with diffusion models by reducing computational burden without sacrificing accuracy.

Abstract

Temporal Action Segmentation (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. However, the ambiguous boundaries between actions pose a significant challenge for high-precision segmentation. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. However, the heavy sampling steps required by diffusion models pose a substantial computational burden, limiting their practicality in real-time applications. Additionally, most related works utilize Transformer-based encoder architectures. Although these architectures excel at capturing long-range dependencies, they incur high computational costs and face feature-smoothing issues when processing long video sequences. To address these challenges, we propose EffiDiffAct, an efficient and high-performance TAS algorithm. Specifically, we develop a lightweight temporal feature encoder that reduces computational overhead and mitigates the rank collapse phenomenon associated with traditional self-attention mechanisms. Furthermore, we introduce an adaptive skip strategy that allows for dynamic adjustment of timestep lengths based on computed similarity metrics during inference, thereby further enhancing computational efficiency. Comprehensive experiments on the 50Salads, Breakfast, and GTEA datasets demonstrated the effectiveness of the proposed algorithm.
Paper Structure (14 sections, 11 equations, 6 figures, 4 tables)

This paper contains 14 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The application process of diffusion models in TAS tasks.
  • Figure 2: The overall framework of the EffiDiffAct algorithm. Given video features as conditional information, the model learns the mapping between noise sequences and action labels during the training phase, and during the inference phase, it restores action labels $\hat{Y}_{0}$ from noise sequences $\hat{Y}_{S}$.
  • Figure 3: The overall framework of the TDP encoder, which maximizes video information through three levels: boundary level, global level, and dilation level.
  • Figure 4: Comparison diagram of fixed skip strategy and adaptive skip strategy.
  • Figure 5: Example of the iterative denoising process for the adaptive skip strategy, where different colors in the diagram represent different action categories.
  • ...and 1 more figures