FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

Wenhan Wu; Pengfei Wang; Chen Chen; Aidong Lu

FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

Wenhan Wu, Pengfei Wang, Chen Chen, Aidong Lu

TL;DR

This work tackles the resource-intensity of transformer-based skeleton action recognition by proposing FreqMixFormerV2, a lightweight frequency-aware transformer. It achieves a 60% parameter count relative to the original model and uses a simplified architecture that combines HFAB, LFAB, and SAB with a new high-low-frequency operator, including a DCT-based frequency path. Across NTU-60, NTU-120, and NW-UCLA benchmarks, it delivers competitive or state-of-the-art accuracy with only a $0.8\%$ accuracy drop compared to the larger model. The approach enables efficient deployment in resource-constrained settings and demonstrates the value of explicit high/low-frequency modulation for discriminative skeletal action recognition.

Abstract

Transformer-based human skeleton action recognition has been developed for years. However, the complexity and high parameter count demands of these models hinder their practical applications, especially in resource-constrained environments. In this work, we propose FreqMixForemrV2, which was built upon the Frequency-aware Mixed Transformer (FreqMixFormer) for identifying subtle and discriminative actions with pioneered frequency-domain analysis. We design a lightweight architecture that maintains robust performance while significantly reducing the model complexity. This is achieved through a redesigned frequency operator that optimizes high-frequency and low-frequency parameter adjustments, and a simplified frequency-aware attention module. These improvements result in a substantial reduction in model parameters, enabling efficient deployment with only a minimal sacrifice in accuracy. Comprehensive evaluations of standard datasets (NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets) demonstrate that the proposed model achieves a superior balance between efficiency and accuracy, outperforming state-of-the-art methods with only 60% of the parameters.

FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

TL;DR

accuracy drop compared to the larger model. The approach enables efficient deployment in resource-constrained settings and demonstrates the value of explicit high/low-frequency modulation for discriminative skeletal action recognition.

Abstract

Paper Structure (15 sections, 15 equations, 5 figures, 5 tables)

This paper contains 15 sections, 15 equations, 5 figures, 5 tables.

INTRODUCTION
METHOD
FreqMixFormerV2 vs. FreqMixFormer
Data Processing
Lightweight Frequency-aware Mixed Transformer
Mixed Spatial Attention
High-Low Frequency-aware Attention
Mixed Temporal Attention and Action Head
EXPERIMENTS
Datasets and Implementation
Comparison with the State-of-the-Art
Ablation Study
Search for the Best Partition $N$ of High and Low Frequency Coefficients
Search for the Best High-Frequency Operator $h$ and Low-Frequency Operator $\ell$
Conclusion

Figures (5)

Figure 1: Performance vs. model size on NTU-60 shahroudy2016ntu X-Sub setting. Our FreqMixFormerV2 demonstrates superior performance and efficiency compared to previous transformer-based methods. Notably, our method reduces the number of parameters by nearly half compared to FreqMixFormer while still achieving state-of-the-art accuracy.
Figure 2: FreqMixFormerV2 vs. FreqMixFormer. Given the skeleton sequence, we first apply joint and positional embeddings to obtain the embedded representation, denoted as $X$. Then $X$ is processed by a high-low frequency-aware attention block (combined with high-frequency and low-frequency attention blocks, as shown in Fig. \ref{['fig:fig3']}) to extract a mixed high-low frequency attention map. A spatial attention block is also employed for mixed spatial attention maps. These maps are then concatenated into a feature $M$, along with the Value$V$, to serve as the input for the temporal attention block. This process facilitates inter-frame joint correlation learning and the resulting output, $X_{out}$ for action classification. The difference analysis can be found in Section \ref{['sec:vs']}.
Figure 3: The mixed frequency blocks in FreqMixformerV2: (a) High-Frequency Attention Block (HFAB), (b) Low-Frequency Attention Block (LFAB). The common blocks applied in FreqMixFormer and FreqMixFormerV2 are (c) Spatial Attention Block and (d) Temporal Attention Block.
Figure 4: Accuracy comparison results on confusing actions in the hard set.
Figure 5: Accuracy comparison results on confusing actions in the medium set.

FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

TL;DR

Abstract

FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)