From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition
Kimji N. Pellano, Inga Strümke, Espen Alexander F. Ihlen
TL;DR
The paper addresses the lack of validated XAI evaluation metrics for skeleton-based HAR by testing $PGI$/$PGU$ faithfulness and $RIS$/$ROS$/$RRS$ stability on CAM and Grad-CAM explanations produced by EfficientGCN on the NTU RGB+D-60 dataset. It introduces biomechanically constrained perturbations with perturbation radius $r$ (tested from $2.5$ to $80$ cm) to assess metric robustness while keeping human kinematics realistic. The key finding is that faithfulness can be unreliable for this model, whereas stability provides a more dependable measure; moreover, CAM and Grad-CAM yield nearly identical explanations, highlighting the need for more diverse XAI methods in skeleton HAR. The study underscores the practical need for developing domain-specific XAI metrics and methods to ensure trustworthy explanations in high-stakes HAR applications, and it advocates broader cross-model analyses to guide method selection.
Abstract
The advancement of deep learning in human activity recognition (HAR) using 3D skeleton data is critical for applications in healthcare, security, sports, and human-computer interaction. This paper tackles a well-known gap in the field, which is the lack of testing in the applicability and reliability of XAI evaluation metrics in the skeleton-based HAR domain. We have tested established XAI metrics namely faithfulness and stability on Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) to address this problem. The study also introduces a perturbation method that respects human biomechanical constraints to ensure realistic variations in human movement. Our findings indicate that \textit{faithfulness} may not be a reliable metric in certain contexts, such as with the EfficientGCN model. Conversely, stability emerges as a more dependable metric when there is slight input data perturbations. CAM and Grad-CAM are also found to produce almost identical explanations, leading to very similar XAI metric performance. This calls for the need for more diversified metrics and new XAI methods applied in skeleton-based HAR.
