Table of Contents
Fetching ...

Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained Skeleton-Based Action Recognition

Haochen Chang, Jing Chen, Yilin Li, Jixiang Chen, Xiaofeng Zhang

TL;DR

Problem: fine-grained skeleton action recognition suffers from subtle inter-class differences and high similarity among actions. Approach: a Wavelet-Attention Decoupling (WAD) module uses a 1D discrete wavelet transform to split features into $\mathbf{X}_{low}$ and $\mathbf{X}_{high}$, enabling adaptive decoupling that yields $\mathbf{X}_{salient}$ and $\mathbf{X}_{subtle}$, followed by Fine-grained Contrastive Enhancement (FCE) with trajectory-wise attention and prototype contrastive loss to sharpen subtle cues. Contributions: (i) time-frequency decoupling with WAD, (ii) trajectory-based FCE with prototype supervision, and (iii) a fusion objective combining $\mathbf{X}_{fuse}=\mathbf{X}_{salient}+\mathbf{X}_{subtle}$ and multiple losses $\mathcal{L}=\lambda_{fuse}\mathcal{L}_{fuse}+\lambda_{salient}\mathcal{L}_{salient}+\lambda_{proto}\mathcal{L}_{proto}$. Results: strong performance on NTU RGB+D and FineGYM, particularly for hard-to-distinguish actions, validating the effectiveness of frequency-domain decoupling for fine-grained skeleton-based recognition.

Abstract

Skeleton-based action recognition has attracted much attention, benefiting from its succinctness and robustness. However, the minimal inter-class variation in similar action sequences often leads to confusion. The inherent spatiotemporal coupling characteristics make it challenging to mine the subtle differences in joint motion trajectories, which is critical for distinguishing confusing fine-grained actions. To alleviate this problem, we propose a Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain. Then, the decoupling attention adaptively recalibrates their temporal responses. To further amplify the discrepancies in these subtle motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning. Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.

Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained Skeleton-Based Action Recognition

TL;DR

Problem: fine-grained skeleton action recognition suffers from subtle inter-class differences and high similarity among actions. Approach: a Wavelet-Attention Decoupling (WAD) module uses a 1D discrete wavelet transform to split features into and , enabling adaptive decoupling that yields and , followed by Fine-grained Contrastive Enhancement (FCE) with trajectory-wise attention and prototype contrastive loss to sharpen subtle cues. Contributions: (i) time-frequency decoupling with WAD, (ii) trajectory-based FCE with prototype supervision, and (iii) a fusion objective combining and multiple losses . Results: strong performance on NTU RGB+D and FineGYM, particularly for hard-to-distinguish actions, validating the effectiveness of frequency-domain decoupling for fine-grained skeleton-based recognition.

Abstract

Skeleton-based action recognition has attracted much attention, benefiting from its succinctness and robustness. However, the minimal inter-class variation in similar action sequences often leads to confusion. The inherent spatiotemporal coupling characteristics make it challenging to mine the subtle differences in joint motion trajectories, which is critical for distinguishing confusing fine-grained actions. To alleviate this problem, we propose a Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain. Then, the decoupling attention adaptively recalibrates their temporal responses. To further amplify the discrepancies in these subtle motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning. Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.
Paper Structure (11 sections, 4 equations, 4 figures, 2 tables)

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The salient features of fine-grained actions frequently exhibit high similarity, but the distinctions primarily manifest in the subtle features within the red box. WDCE-Net decouples the above two features in the frequency domain and focuses on enhancing subtle features. Compared with traditional methods, our method can cluster fine-grained action features better.
  • Figure 2: (a) Overview of the proposed WDCE-Net. (b) Wavelet-Attention Decoupling (WAD) module maps the original features into the time-frequency domain and decouples salient and subtle motion features. Fine-grained Contrastive Enhancement (FCE) module enhances subtle features and amplifies the differences of confusing actions.
  • Figure 3: Accuracy comparison of our method with ST-GCN, CTR-GCN and FR-Head. (a) Results on three sub-datasets. (b) Results on six easily confusing actions.
  • Figure 4: Visualization of features by t-SNE. (a)$\sim$(c) Visualization results of CTR-GCN, FR-Head, and our method. (d) Feature decoupling results of “Reading” class samples.