Table of Contents
Fetching ...

MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

Naichuan Zheng, Hailun Xia, Zeyu Liang, Yuchen Du

TL;DR

MK-SGN tackles the energy bottleneck of skeleton-based action recognition by marrying Spiking Neural Networks with Graph Convolutional Networks through Multimodal Fusion and Knowledge Distillation. The method encodes skeleton streams into spike-form with the Skeleton Spiking Coding (SSC) module, fuses four modalities via Spike-based Multimodal Fusion (SMF) guided by mutual information, and processes the fused spikes with a Self-Attention Spiking Graph Convolution (SA-SGC) and Spiking Temporal Convolution (STC). A GCN-to-SGN distillation pipeline, including both inner-layer feature distillation via a Feature Translation Module (FTM) and soft-label distillation, transfers rich multimodal knowledge from a 10-layer GCN teacher to a 6-layer SGN student. Empirically, MK-SGN achieves substantial energy savings (approximately $0.596\ ext{mJ}$ per sample, a reduction of over $98\%$ vs strong ANN baselines) while delivering competitive accuracy across NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA datasets (e.g., $78.5\%$ XSub and $85.6\%$ XView on NTU-RGB+D; $92.3\%$ on NW-UCLA), establishing a new energy–accuracy baseline for skeleton-based action recognition on neuromorphic platforms.

Abstract

In recent years, multimodal Graph Convolutional Networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. The reliance on high-energy-consuming continuous floating-point operations inherent in GCN-based methods poses significant challenges for deployment in energy-constrained, battery-powered edge devices. To address these limitations, MK-SGN, a Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation, is proposed to leverage the energy efficiency of Spiking Neural Networks (SNNs) for skeleton-based action recognition for the first time. By integrating the energy-saving properties of SNNs with the graph representation capabilities of GCNs, MK-SGN achieves significant reductions in energy consumption while maintaining competitive recognition accuracy. Firstly, we formulate a Spiking Multimodal Fusion (SMF) module to effectively fuse multimodal skeleton data represented as spike-form features. Secondly, we propose the Self-Attention Spiking Graph Convolution (SA-SGC) module and the Spiking Temporal Convolution (STC) module, to capture spatial relationships and temporal dynamics of spike-form features. Finally, we propose an integrated knowledge distillation strategy to transfer information from the multimodal GCN to the SGN, incorporating both intermediate-layer distillation and soft-label distillation to enhance the performance of the SGN. MK-SGN exhibits substantial advantages, surpassing state-of-the-art GCN frameworks in energy efficiency and outperforming state-of-the-art SNN frameworks in recognition accuracy. The proposed method achieves a remarkable reduction in energy consumption, exceeding 98\% compared to conventional GCN-based approaches. This research establishes a robust baseline for developing high-performance, energy-efficient SNN-based models for skeleton-based action recognition

MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

TL;DR

MK-SGN tackles the energy bottleneck of skeleton-based action recognition by marrying Spiking Neural Networks with Graph Convolutional Networks through Multimodal Fusion and Knowledge Distillation. The method encodes skeleton streams into spike-form with the Skeleton Spiking Coding (SSC) module, fuses four modalities via Spike-based Multimodal Fusion (SMF) guided by mutual information, and processes the fused spikes with a Self-Attention Spiking Graph Convolution (SA-SGC) and Spiking Temporal Convolution (STC). A GCN-to-SGN distillation pipeline, including both inner-layer feature distillation via a Feature Translation Module (FTM) and soft-label distillation, transfers rich multimodal knowledge from a 10-layer GCN teacher to a 6-layer SGN student. Empirically, MK-SGN achieves substantial energy savings (approximately per sample, a reduction of over vs strong ANN baselines) while delivering competitive accuracy across NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA datasets (e.g., XSub and XView on NTU-RGB+D; on NW-UCLA), establishing a new energy–accuracy baseline for skeleton-based action recognition on neuromorphic platforms.

Abstract

In recent years, multimodal Graph Convolutional Networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. The reliance on high-energy-consuming continuous floating-point operations inherent in GCN-based methods poses significant challenges for deployment in energy-constrained, battery-powered edge devices. To address these limitations, MK-SGN, a Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation, is proposed to leverage the energy efficiency of Spiking Neural Networks (SNNs) for skeleton-based action recognition for the first time. By integrating the energy-saving properties of SNNs with the graph representation capabilities of GCNs, MK-SGN achieves significant reductions in energy consumption while maintaining competitive recognition accuracy. Firstly, we formulate a Spiking Multimodal Fusion (SMF) module to effectively fuse multimodal skeleton data represented as spike-form features. Secondly, we propose the Self-Attention Spiking Graph Convolution (SA-SGC) module and the Spiking Temporal Convolution (STC) module, to capture spatial relationships and temporal dynamics of spike-form features. Finally, we propose an integrated knowledge distillation strategy to transfer information from the multimodal GCN to the SGN, incorporating both intermediate-layer distillation and soft-label distillation to enhance the performance of the SGN. MK-SGN exhibits substantial advantages, surpassing state-of-the-art GCN frameworks in energy efficiency and outperforming state-of-the-art SNN frameworks in recognition accuracy. The proposed method achieves a remarkable reduction in energy consumption, exceeding 98\% compared to conventional GCN-based approaches. This research establishes a robust baseline for developing high-performance, energy-efficient SNN-based models for skeleton-based action recognition
Paper Structure (40 sections, 38 equations, 8 figures, 12 tables)

This paper contains 40 sections, 38 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Transformation from skeleton to spikes. The former GCN performs feature extraction and information propagation on nodes of the graph structure through graph convolutional layers, while the latter SGN employs spike encoding to convert features into discrete spike-form, conducting feature extraction and information propagation in the time dimension of the spike-form feature.
  • Figure 2: The overview of the proposed MK-SGN architecture. The teacher network comprises a 10-layer GCN that processes four multimodal skeleton data separately, followed by Global Average Pooling (GAP) and Fully Connected (FC) layers to generate soft labels. The student network is a 6-layer SGN that incorporates the following key modules: (a) The Skeleton Spiking Coding (SSC) module, responsible for transforming raw skeleton data into spike-form feature to ensure compatibility with spiking neural computations. (b) The Spiking Self-Attention Spiking Graph Convolution (SA-SGC) module is formulated to model spatial dependencies and enhance feature extraction by integrating Spiking Graph Convolution and Spiking Self-Attention (SSA). (c) The Spiking Temporal Convolution (STC) module is proposed to capture temporal patterns and refine temporal dynamics within spike-form features. Additionally, the Spike-based Multimodal Fusion Module (SMF) optimizes cross-modal integration based on mutual information, and the Feature Transformation Module (FTM) maps inter-layer GCN features into spike-form features for effective alignment with the SGN. Knowledge distillation is achieved through the FTM and supervision from the soft labels produced by the teacher network.
  • Figure 3: Multimodal Skeleton Spiking Coding and Mutual Information Weight Matrix Calculation Process
  • Figure 4: GCN-to-SGN Knowledge Distillation Method
  • Figure 5: Spiking attention maps for different actions. The x-axis represents NTU 25 joints, and the y-axis corresponds to time steps $T$. The grayscale intensity indicates the activation level of the attention mechanism at each joint over time.
  • ...and 3 more figures