From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition

Kimji N. Pellano; Inga Strümke; Espen Alexander F. Ihlen

From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition

Kimji N. Pellano, Inga Strümke, Espen Alexander F. Ihlen

TL;DR

The paper addresses the lack of validated XAI evaluation metrics for skeleton-based HAR by testing $PGI$/$PGU$ faithfulness and $RIS$/$ROS$/$RRS$ stability on CAM and Grad-CAM explanations produced by EfficientGCN on the NTU RGB+D-60 dataset. It introduces biomechanically constrained perturbations with perturbation radius $r$ (tested from $2.5$ to $80$ cm) to assess metric robustness while keeping human kinematics realistic. The key finding is that faithfulness can be unreliable for this model, whereas stability provides a more dependable measure; moreover, CAM and Grad-CAM yield nearly identical explanations, highlighting the need for more diverse XAI methods in skeleton HAR. The study underscores the practical need for developing domain-specific XAI metrics and methods to ensure trustworthy explanations in high-stakes HAR applications, and it advocates broader cross-model analyses to guide method selection.

Abstract

The advancement of deep learning in human activity recognition (HAR) using 3D skeleton data is critical for applications in healthcare, security, sports, and human-computer interaction. This paper tackles a well-known gap in the field, which is the lack of testing in the applicability and reliability of XAI evaluation metrics in the skeleton-based HAR domain. We have tested established XAI metrics namely faithfulness and stability on Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) to address this problem. The study also introduces a perturbation method that respects human biomechanical constraints to ensure realistic variations in human movement. Our findings indicate that \textit{faithfulness} may not be a reliable metric in certain contexts, such as with the EfficientGCN model. Conversely, stability emerges as a more dependable metric when there is slight input data perturbations. CAM and Grad-CAM are also found to produce almost identical explanations, leading to very similar XAI metric performance. This calls for the need for more diversified metrics and new XAI methods applied in skeleton-based HAR.

From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition

TL;DR

The paper addresses the lack of validated XAI evaluation metrics for skeleton-based HAR by testing

faithfulness and

stability on CAM and Grad-CAM explanations produced by EfficientGCN on the NTU RGB+D-60 dataset. It introduces biomechanically constrained perturbations with perturbation radius

(tested from

cm) to assess metric robustness while keeping human kinematics realistic. The key finding is that faithfulness can be unreliable for this model, whereas stability provides a more dependable measure; moreover, CAM and Grad-CAM yield nearly identical explanations, highlighting the need for more diverse XAI methods in skeleton HAR. The study underscores the practical need for developing domain-specific XAI metrics and methods to ensure trustworthy explanations in high-stakes HAR applications, and it advocates broader cross-model analyses to guide method selection.

Abstract

Paper Structure (15 sections, 8 equations, 5 figures, 2 tables)

This paper contains 15 sections, 8 equations, 5 figures, 2 tables.

Introduction
Materials
NTU RGB+D 60 dataset and EfficientGCN
Evaluation Metrics
Faithfulness
Stability
Methods
Skeleton Data Perturbation
Calculation and Evaluation of XAI Metrics
Results
Faithfulness
Stability
Discussion and Conclusion
Class 11 Metric Values
Class 26 Metric Values

Figures (5)

Figure 1: Illustration of perturbing a point P(x, y, z) in 3D space to a new position P'(x', y', z') using spherical coordinates. The perturbation magnitude is represented by $r$, with azimuthal angle $\theta$ and polar angle $\phi$.
Figure 2: The EfficientGCN pipeline song2022constructing showing the variables for calculating faithfulness and stability. Perturbation is performed in Data Preprocess stage.
Figure 3: Left to right: CAM, Grad-CAM, and baseline random attributions for a data instance in 'writing' (class 11), averaged for all frames and normalized. The color gradient denotes the score intensity: blue indicates 0, progressing to red which indicates a score of 1.
Figure 4: Evaluation metric outcomes for 'Writing' (Class 11, i.e. the weakest class), showing CAM (blue), Grad-CAM (orange), and the random (green) methods, for (\ref{['fig:class11_pgi']}) PGI, (\ref{['fig:class11_pgu']}) PGU, (\ref{['fig:class11_risb']}) RISb, (\ref{['fig:class11_risj']}) RISj, (\ref{['fig:class11_risv']}) RISv, (\ref{['fig:class11_ros']}) ROS, and (\ref{['fig:class11_rrs']}) RRS. The $y$-axis measures the metric values, while the $x$-axis shows the perturbation magnitude. CAM and Grad-CAM graphs overlap due to extremely similar metric outcomes.
Figure 5: Evaluation metric outcomes for 'Jump Up' (Class 26, i.e. the strongest class), showing CAM (blue), Grad-CAM (orange), and the random (green) methods, for (\ref{['fig:class26_pgi']}) PGI, (\ref{['fig:class26_pgu']}) PGU, (\ref{['fig:class26_risb']}) RISb, (\ref{['fig:class26_risj']}) RISj, (\ref{['fig:class26_risv']}) RISv, (\ref{['fig:class26_ros']}) ROS, and (\ref{['fig:class26_rrs']}) RRS. The $y$-axis measures the metric values, while the $x$-axis shows the perturbation magnitude. CAM and Grad-CAM graphs overlap due to extremely similar metric outcomes.

From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition

TL;DR

Abstract

From Movements to Metrics: Evaluating Explainable AI Methods in Skeleton-Based Human Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)