MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition
Naichuan Zheng, Hailun Xia, Zeyu Liang, Yuchen Du
TL;DR
MK-SGN tackles the energy bottleneck of skeleton-based action recognition by marrying Spiking Neural Networks with Graph Convolutional Networks through Multimodal Fusion and Knowledge Distillation. The method encodes skeleton streams into spike-form with the Skeleton Spiking Coding (SSC) module, fuses four modalities via Spike-based Multimodal Fusion (SMF) guided by mutual information, and processes the fused spikes with a Self-Attention Spiking Graph Convolution (SA-SGC) and Spiking Temporal Convolution (STC). A GCN-to-SGN distillation pipeline, including both inner-layer feature distillation via a Feature Translation Module (FTM) and soft-label distillation, transfers rich multimodal knowledge from a 10-layer GCN teacher to a 6-layer SGN student. Empirically, MK-SGN achieves substantial energy savings (approximately $0.596\ ext{mJ}$ per sample, a reduction of over $98\%$ vs strong ANN baselines) while delivering competitive accuracy across NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA datasets (e.g., $78.5\%$ XSub and $85.6\%$ XView on NTU-RGB+D; $92.3\%$ on NW-UCLA), establishing a new energy–accuracy baseline for skeleton-based action recognition on neuromorphic platforms.
Abstract
In recent years, multimodal Graph Convolutional Networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. The reliance on high-energy-consuming continuous floating-point operations inherent in GCN-based methods poses significant challenges for deployment in energy-constrained, battery-powered edge devices. To address these limitations, MK-SGN, a Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation, is proposed to leverage the energy efficiency of Spiking Neural Networks (SNNs) for skeleton-based action recognition for the first time. By integrating the energy-saving properties of SNNs with the graph representation capabilities of GCNs, MK-SGN achieves significant reductions in energy consumption while maintaining competitive recognition accuracy. Firstly, we formulate a Spiking Multimodal Fusion (SMF) module to effectively fuse multimodal skeleton data represented as spike-form features. Secondly, we propose the Self-Attention Spiking Graph Convolution (SA-SGC) module and the Spiking Temporal Convolution (STC) module, to capture spatial relationships and temporal dynamics of spike-form features. Finally, we propose an integrated knowledge distillation strategy to transfer information from the multimodal GCN to the SGN, incorporating both intermediate-layer distillation and soft-label distillation to enhance the performance of the SGN. MK-SGN exhibits substantial advantages, surpassing state-of-the-art GCN frameworks in energy efficiency and outperforming state-of-the-art SNN frameworks in recognition accuracy. The proposed method achieves a remarkable reduction in energy consumption, exceeding 98\% compared to conventional GCN-based approaches. This research establishes a robust baseline for developing high-performance, energy-efficient SNN-based models for skeleton-based action recognition
