Table of Contents
Fetching ...

Evolving Skeletons: Motion Dynamics in Action Recognition

Jushang Qiu, Lei Wang

TL;DR

The paper tackles the challenge of capturing dynamic motion and higher-order joint interactions in skeleton-based action recognition. It systematically compares traditional skeletal graphs via ST-GCN and hypergraph-based Hyperformer, evaluating both on original skeletons and Taylor-transformed skeletons across NTU-60 and NTU-120 datasets. Key findings show Hyperformer generally outperforms ST-GCN, while Taylor-transformed skeletons enhance motion-sensitive actions for graph-based models but can diminish performance for actions relying on spatial context, indicating the need for architectures that synergistically fuse motion and spatial cues. These insights guide the design of next-generation skeletal models and suggest hybrid architectures to robustly handle motion-rich data in action recognition.

Abstract

Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.

Evolving Skeletons: Motion Dynamics in Action Recognition

TL;DR

The paper tackles the challenge of capturing dynamic motion and higher-order joint interactions in skeleton-based action recognition. It systematically compares traditional skeletal graphs via ST-GCN and hypergraph-based Hyperformer, evaluating both on original skeletons and Taylor-transformed skeletons across NTU-60 and NTU-120 datasets. Key findings show Hyperformer generally outperforms ST-GCN, while Taylor-transformed skeletons enhance motion-sensitive actions for graph-based models but can diminish performance for actions relying on spatial context, indicating the need for architectures that synergistically fuse motion and spatial cues. These insights guide the design of next-generation skeletal models and suggest hybrid architectures to robustly handle motion-rich data in action recognition.

Abstract

Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.
Paper Structure (14 sections, 13 figures, 9 tables)

This paper contains 14 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Visual comparison of (Left) a static pose and (Right) a Taylor-transformed skeleton for the action cheer up. Motion dynamics are overlaid on the original skeleton for enhanced visualization. The size of the green circles indicates motion intensity, with larger circles representing bigger motions. Taylor-transformed skeletons emphasize dominant motions and dynamic patterns, while static poses highlight spatial arrangements of the joints. Additional visualizations are provided in the Appendix.
  • Figure 2: The evaluation pipeline explores the performance of two state-of-the-art models, ST-GCN and Hyperformer. Both models are tested using original skeleton sequences, which emphasize spatial relationships, and Taylor-transformed skeletons, which highlight motion dynamics. This dual approach enables a comprehensive analysis of how spatial and temporal features impact action recognition performance, revealing distinct strengths and limitations for each model and data representation.
  • Figure 3: Confusion matrices of ST-GCN on NTU-60 (X-Sub) using (a) original skeletons and (b) Taylor-transformed skeletons. Along the diagonal, darker colors represent higher recognition accuracy for each action class. Predictions below 5% are filtered out for clarity, with the complete confusion matrices available in the appendix. Taylor-transformed skeletons improve recognition accuracy for actions such as using a fan (with hand or paper), wearing a shoe, and touching head (headache). However, a decline in performance is observed for actions like hopping (one-foot jumping), sitting down, and touching chest (stomachache/heart pain). This drop may result from noise introduced by certain motion features, which negatively impact the model's performance. For a detailed examination, zooming in is recommended.
  • Figure 4: Confusion matrices of Hyperformer on NTU-60 (X-Sub) using (a) original skeletons and (b) Taylor-transformed skeletons. Along the diagonal, darker colors represent higher recognition accuracy for each action class. Predictions below 5% are filtered out for clarity, with the complete confusion matrices available in the appendix. Taylor-transformed skeletons improve recognition accuracy for actions such as hopping and taking off jacket. However, a decline in performance is observed for actions like pointing to something with finger. This drop may result from noise introduced by certain motion features, which negatively impact the model's performance. For a detailed examination, zooming in is recommended.
  • Figure 5: Visual comparison of (top) original skeletons and (bottom) Taylor-transformed skeletons. From left to right, the depicted actions are: "take off a shoe", "wear a shoe", "kicking something", "nausea or vomiting", "sneeze/cough", and "touch chest (stomachache/heart pain)". Taylor-transformed skeletons are overlaid on the original skeletons, with green circles indicating the intensity of motion.
  • ...and 8 more figures