Table of Contents
Fetching ...

ASFormer: Transformer for Action Segmentation

Fangqiu Yi, Hongyu Wen, Tingting Jiang

TL;DR

ASFormer tackles frame-level action segmentation in long videos by integrating a dilated temporal-convolutional encoder with a hierarchical attention pattern and a cross-attentive, multi-decoder design for iterative refinement. By introducing local connectivity bias, a predefined hierarchical representation, and refinement-capable decoding, it achieves state-of-the-art results on 50Salads, GTEA, and Breakfast while remaining scalable to long sequences. The work demonstrates that carefully engineered inductive biases and decoder design enable Transformer-based models to excel in temporally structured video tasks and provides a strong backbone for future action-segmentation research.

Abstract

Algorithms for the action segmentation task typically use temporal models to predict what action is occurring at each frame for a minute-long daily activity. Recent studies have shown the potential of Transformer in modeling the relations among elements in sequential data. However, there are several major concerns when directly applying the Transformer to the action segmentation task, such as the lack of inductive biases with small training sets, the deficit in processing long input sequence, and the limitation of the decoder architecture to utilize temporal relations among multiple action segments to refine the initial predictions. To address these concerns, we design an efficient Transformer-based model for action segmentation task, named ASFormer, with three distinctive characteristics: (i) We explicitly bring in the local connectivity inductive priors because of the high locality of features. It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets. (ii) We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences. (iii) We carefully design the decoder to refine the initial predictions from the encoder. Extensive experiments on three public datasets demonstrate that effectiveness of our methods. Code is available at \url{https://github.com/ChinaYi/ASFormer}.

ASFormer: Transformer for Action Segmentation

TL;DR

ASFormer tackles frame-level action segmentation in long videos by integrating a dilated temporal-convolutional encoder with a hierarchical attention pattern and a cross-attentive, multi-decoder design for iterative refinement. By introducing local connectivity bias, a predefined hierarchical representation, and refinement-capable decoding, it achieves state-of-the-art results on 50Salads, GTEA, and Breakfast while remaining scalable to long sequences. The work demonstrates that carefully engineered inductive biases and decoder design enable Transformer-based models to excel in temporally structured video tasks and provides a strong backbone for future action-segmentation research.

Abstract

Algorithms for the action segmentation task typically use temporal models to predict what action is occurring at each frame for a minute-long daily activity. Recent studies have shown the potential of Transformer in modeling the relations among elements in sequential data. However, there are several major concerns when directly applying the Transformer to the action segmentation task, such as the lack of inductive biases with small training sets, the deficit in processing long input sequence, and the limitation of the decoder architecture to utilize temporal relations among multiple action segments to refine the initial predictions. To address these concerns, we design an efficient Transformer-based model for action segmentation task, named ASFormer, with three distinctive characteristics: (i) We explicitly bring in the local connectivity inductive priors because of the high locality of features. It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets. (ii) We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences. (iii) We carefully design the decoder to refine the initial predictions from the encoder. Extensive experiments on three public datasets demonstrate that effectiveness of our methods. Code is available at \url{https://github.com/ChinaYi/ASFormer}.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: (b). An overall graph of the ASFormer model which consists of an encoder and several decoders to perform an iterative refinement. For the encoder, it receives video sequences and outputs initial predictions. Encoder consists of serials of encoder blocks with pre-defined hierarchical representation patterns. For the decoder, it receives predictions as the input and has similar architecture with encoder. (a). In each encoder block, it consists of a feed-forward layer (dilated temporal convolution) and a self-attention layer with residual connections. (c). The decoder block uses cross-attention mechanism to bring in information from the encoder.
  • Figure 2: The visualization of attention weights for an anchor frame (red +) in each encoder block, more visualization can be found in the supplementary material. (a) the non-hierarchical (by setting the window size to 512 in all blocks). (b) With the hierarchical pattern.