Table of Contents
Fetching ...

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

Hongda Liu, Yunfan Liu, Min Ren, Hao Wang, Yunlong Wang, Zhenan Sun

TL;DR

This paper tackles the challenge of distinguishing similar actions in skeleton data by introducing ProtoGCN, a graph-based approach that decomposes actions into a mixture of learnable motion prototypes. It combines a Prototype Reconstruction Network with a memory of prototypes, a Motion Topology Enhancement module to enrich joint relationships, and a class-specific contrastive objective to sharpen inter-class separability, all within a GCN framework. The method achieves state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, and ablations demonstrate the effectiveness of PRN, MTE, and CSCL in producing compact, discriminative representations. The approach offers a practical impact by improving fine-grained action recognition in real-world skeleton datasets, with code released for reproducibility.

Abstract

In skeleton-based action recognition, a key challenge is distinguishing between actions with similar trajectories of joints due to the lack of image-level details in skeletal representations. Recognizing that the differentiation of similar actions relies on subtle motion details in specific body parts, we direct our approach to focus on the fine-grained motion of local skeleton components. To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. By contrasting the reconstruction of prototypes, ProtoGCN can effectively identify and enhance the discriminative representation of similar actions. Without bells and whistles, ProtoGCN achieves state-of-the-art performance on multiple benchmark datasets, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, which demonstrates the effectiveness of the proposed method. The code is available at https://github.com/firework8/ProtoGCN.

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

TL;DR

This paper tackles the challenge of distinguishing similar actions in skeleton data by introducing ProtoGCN, a graph-based approach that decomposes actions into a mixture of learnable motion prototypes. It combines a Prototype Reconstruction Network with a memory of prototypes, a Motion Topology Enhancement module to enrich joint relationships, and a class-specific contrastive objective to sharpen inter-class separability, all within a GCN framework. The method achieves state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, and ablations demonstrate the effectiveness of PRN, MTE, and CSCL in producing compact, discriminative representations. The approach offers a practical impact by improving fine-grained action recognition in real-world skeleton datasets, with code released for reproducibility.

Abstract

In skeleton-based action recognition, a key challenge is distinguishing between actions with similar trajectories of joints due to the lack of image-level details in skeletal representations. Recognizing that the differentiation of similar actions relies on subtle motion details in specific body parts, we direct our approach to focus on the fine-grained motion of local skeleton components. To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. By contrasting the reconstruction of prototypes, ProtoGCN can effectively identify and enhance the discriminative representation of similar actions. Without bells and whistles, ProtoGCN achieves state-of-the-art performance on multiple benchmark datasets, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, which demonstrates the effectiveness of the proposed method. The code is available at https://github.com/firework8/ProtoGCN.

Paper Structure

This paper contains 17 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the skeletons and learned topologies for similar actions Writing and Typing on a Keyboard (deeper color indicates stronger relationships between corresponding joints). As shown in (a) and (c), the baseline PYSKL duan2022pyskl demonstrates its ability to focus on joints associated with hands, but falls short in revealing their distinctive motion characteristics. In contrast, the integration of the Graph Prototype Reconstruction mechanism facilitates a clearer differentiation between the two actions, as evidenced by the notably distinct motion patterns observed between (b) and (d). Please zoom in for a better view.
  • Figure 2: The overall architecture of ProtoGCN. A Prototype Reconstruction Network is proposed to transform the representation of graph topology $\mathbf{X}$ into a combination $\mathbf{Z}$ of learnable prototypes at the fine-grained joint level, thereby enhancing the distinctiveness of features. In specific, the prototypes represent diverse relationship patterns between all the human joints. Additionally, at each layer of the network, the Motion Topology Enhancement module is integrated to capture rich and expressive motion representations, establishing the foundation for prototype learning. Last, the outputs of the model are supervised by the classification loss and class-specific contrastive loss, respectively.
  • Figure 3: Ablation study on the influences of weight $\lambda$ and the memory capacity $n_{pro}$ under the NTU-120 X-Sub setting.
  • Figure 4: Visualization of the topologies learned by PYSKL duan2022pyskl and ProtoGCN across four actions. Darker color indicates stronger correlation between corresponding joints.
  • Figure 5: Action classes with accuracy difference higher than 1% between our method and PYSKL duan2022pyskl on the NTU-120 dataset.