Table of Contents
Fetching ...

Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer

Wenhan Wu, Ce Zheng, Zihao Yang, Chen Chen, Srijan Das, Aidong Lu

TL;DR

This work presents FreqMixFormer, a frequency-aware mixed transformer for skeleton action recognition that addresses subtle discriminative motions by encoding joint trajectories in the frequency domain via Discrete Cosine Transform (DCT) and fusing these frequency features with spatial joint relations through a frequency-aware attention mechanism. The architecture comprises frequency-aware attention blocks (FAB), spatial attention blocks (SAB), a frequency operator to emphasize high-frequency components, and a temporal transformer to capture global inter-frame correlations, culminating in state-of-the-art results on NTU RGB+D, NTU RGB+D 120, and NW-UCLA. Extensive ablations demonstrate the contributions of FAB, the frequency operator, and the temporal module, as well as robust performance on confusing actions. The work highlights the importance of incorporating frequency-domain cues into transformer-based skeleton action recognition, offering practical improvements for precise action understanding in real-world scenarios.

Abstract

Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that exhibit similar motion patterns. To address this challenge, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer), specifically designed for recognizing similar skeletal actions with subtle discriminative motions. First, we introduce a frequency-aware attention module to unweave skeleton frequency representations by embedding joint features into frequency attention maps, aiming to distinguish the discriminative movements based on their frequency coefficients. Subsequently, we develop a mixed transformer architecture to incorporate spatial features with frequency features to model the comprehensive frequency-spatial patterns. Additionally, a temporal transformer is proposed to extract the global correlations across frames. Extensive experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer

TL;DR

This work presents FreqMixFormer, a frequency-aware mixed transformer for skeleton action recognition that addresses subtle discriminative motions by encoding joint trajectories in the frequency domain via Discrete Cosine Transform (DCT) and fusing these frequency features with spatial joint relations through a frequency-aware attention mechanism. The architecture comprises frequency-aware attention blocks (FAB), spatial attention blocks (SAB), a frequency operator to emphasize high-frequency components, and a temporal transformer to capture global inter-frame correlations, culminating in state-of-the-art results on NTU RGB+D, NTU RGB+D 120, and NW-UCLA. Extensive ablations demonstrate the contributions of FAB, the frequency operator, and the temporal module, as well as robust performance on confusing actions. The work highlights the importance of incorporating frequency-domain cues into transformer-based skeleton action recognition, offering practical improvements for precise action understanding in real-world scenarios.

Abstract

Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that exhibit similar motion patterns. To address this challenge, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer), specifically designed for recognizing similar skeletal actions with subtle discriminative motions. First, we introduce a frequency-aware attention module to unweave skeleton frequency representations by embedding joint features into frequency attention maps, aiming to distinguish the discriminative movements based on their frequency coefficients. Subsequently, we develop a mixed transformer architecture to incorporate spatial features with frequency features to model the comprehensive frequency-spatial patterns. Additionally, a temporal transformer is proposed to extract the global correlations across frames. Extensive experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
Paper Structure (31 sections, 18 equations, 12 figures, 10 tables)

This paper contains 31 sections, 18 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The overall design of our Frequency-aware Mixed Transformer. Our FreqMixFormer model overcomes the limitations of traditional transformer-based methods, which cannot effectively recognize confusing actions such as reading and writing due to the straightforward process of skeleton sequences. As highlighted with the colored boxes, the FreqMixFormer introduces the frequency domain and extracts high-frequency features, which often indicate subtle and dynamic movements (red), and low-frequency features, which are associated with slow and steady movements (blue). These features are then fused with spatial features. Our results demonstrate that the integrated frequency-spatial features significantly improve the model's capability to discern discriminative joint correlations.
  • Figure 2: Overview of the proposed FreqMixFormer. Given the skeleton sequence, we first perform the joint and positional embedding to get the embedded $X$. Then $X$ is divided into $n$ ($n$ = 3 as an example in this figure) unit groups as the input $x_i$. The explanation of data partition is available in Section \ref{['sec:overview']}. Next, $x_i$ passes through the Frequency-aware Mixed Transformer to extract the mixed frequency-spatial attention maps $MFS_i$ (the definition is available in Section \ref{['sec:FAMT']} ), which contain the joint fusion patterns from the frequency and spatial domains. These maps are subsequently concatenated into a feature $M$ along with the Value$V$ as the input of the temporal attention block, leading to an inter-frame joint correlation learning, and the corresponding output $X_{out}$ is passed to an FC-layer for the classification.
  • Figure 3: Three different blocks applied in FreqMixFormer: (a) Spatial Attention Block (SAB), (b) Frequency-aware Attention Block (FAB), and (c) Temporal Attention Block (TAB).
  • Figure 4: The visualization of attention matrices. (a) is the joint index of the NTU RGB+D dataset. (b) is the skeleton sequence of the "eat meal" action. The red box indicates the deeper attention area among joints. (c) is the mixed spatial attention map extracted from the spatial attention block. (d) is the mixed frequency attention map extracted from the frequency-aware attention block. (e) is the mixed frequency-spatial attention map, representing the mixed frequency-spatial skeleton features.
  • Figure 5: Accuracy comparison results on confusing actions in the hard set.
  • ...and 7 more figures