Table of Contents
Fetching ...

Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

Edoardo Bianchi, Oswald Lanz

TL;DR

The paper addresses action recognition in sports by incorporating skeletal pose information into RGB-based models. It introduces Gate-Shift-Pose (GSP), which extends Gate-Shift-Fuse with two fusion strategies: early-fusion using pose heatmaps as an input channel and late-fusion using a dual-stream attention-based fusion. On the FR-FS ice-skating dataset, GSP substantially improves accuracy over RGB-only baselines, achieving up to $98.08\%$ with a ResNet50 backbone in early-fusion, and $95.19\%$ with a ResNet18 backbone in late-fusion, while also showing significant gains over the baseline by roughly $20$–$40\%$. The work demonstrates the value of multimodal architectures that integrate skeleton information for capturing complex motion patterns in sports, with practical implications for robust, real-time action recognition and analytics.

Abstract

This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and late-fusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model's capacity for effective multimodal integration, while late-fusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns. Visit the project page at https://edowhite.github.io/Gate-Shift-Pose

Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

TL;DR

The paper addresses action recognition in sports by incorporating skeletal pose information into RGB-based models. It introduces Gate-Shift-Pose (GSP), which extends Gate-Shift-Fuse with two fusion strategies: early-fusion using pose heatmaps as an input channel and late-fusion using a dual-stream attention-based fusion. On the FR-FS ice-skating dataset, GSP substantially improves accuracy over RGB-only baselines, achieving up to with a ResNet50 backbone in early-fusion, and with a ResNet18 backbone in late-fusion, while also showing significant gains over the baseline by roughly . The work demonstrates the value of multimodal architectures that integrate skeleton information for capturing complex motion patterns in sports, with practical implications for robust, real-time action recognition and analytics.

Abstract

This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and late-fusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model's capacity for effective multimodal integration, while late-fusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns. Visit the project page at https://edowhite.github.io/Gate-Shift-Pose

Paper Structure

This paper contains 21 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the GSP (Gate-Shift-Pose) network architecture with two fusion strategies for integrating RGB and skeletal information. Top: In the early-fusion approach, pose data is preprocessed as a Gaussian heatmap and concatenated with RGB frames, forming a four-channel input for the GSF network. Bottom: In the late-fusion approach, RGB frames and skeletal data are processed in separate streams using a GSF network and a Pose network, respectively. Normalized features from each stream are then combined in a fusion layer, followed by multi-head attention and alignment layers to integrate relevant spatio-temporal features before classification.
  • Figure 2: Example from the FR-FS dataset. Left: an RGB frame of an ice skater performing a maneuver. Right: the corresponding Gaussian heatmap highlighting keypoints for skeleton-based feature extraction.
  • Figure 3: Architecture of the pose model. The model processes a skeleton input consisting of 17 keypoints, each represented by x and y coordinates (17 x 2). The input is passed through three fully connected (FC) layers: FC1 (64 neurons), FC2 (128 neurons), and FC3 (128 neurons). ReLU activation functions are applied after the first two layers. The output is a feature embedding vector suitable for downstream tasks.
  • Figure 4: Alignment layers applied after the Multihead Attention module and before the classification layer. These layers are designed to compress and select the most relevant information for the downstream task. The structure includes two fully connected layers (FC1 with 64 neurons and FC2 with 32 neurons), each followed by Batch Normalization, ReLU activation, and Dropout to enhance generalization and robustness.