Table of Contents
Fetching ...

PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi

TL;DR

This work proposes Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose se-quences as motion-aware, structured concepts for video action recognition that provides both strong predictive performance and human-understandable insights into the model's reasoning process.

Abstract

Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.

PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

TL;DR

This work proposes Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose se-quences as motion-aware, structured concepts for video action recognition that provides both strong predictive performance and human-understandable insights into the model's reasoning process.

Abstract

Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.

Paper Structure

This paper contains 39 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparison with existing XAI approaches for video action recognition. (a) Supervoxel-based concepts vcad cluster similar regions but lack motion dynamics. (b) Grad-CAM gradcam highlights spatial regions but fails to capture temporal dependencies. (c) Textual concepts lf-cbm provide abstract labels that miss fine-grained motion cues. (d) PCBEAR uses pose-based concepts to model spatial and temporal dynamics, offering interpretable, motion-aware explanations.
  • Figure 2: Overview of pose-based concept construction. (a) We extract a pose sequence from each video using the pose estimator $f_p(\cdot)$. (b) Since video frames are too long to be directly used as concepts, we perform temporal sub-sampling to obtain overlapping sub-sampled pose sequences of length $T$. (c) We apply clustering to all pose sub-sampled pose sequences in the dataset to discover pose-based concepts. (d) The medoid of each cluster serves as the representative pose, ensuring interpretability by selecting actual samples from the dataset.
  • Figure 3: Training overview of PCBEAR. (a) Given an input clip $x_i$, we extract video features $z_i$ using the video backbone $f_v(\cdot)$, then map them to concept features $a_i$ via the learnable concept projection matrix $W_c$. The model optimizes $W_c$ using the cosine cubed loss to align $a_i$ with the assigned concept label $c_i$. (b) Given concept features $a_i$, the classifier parameterized by $W_F$ and $b_F$ predicts $\hat{y}_i$. To improve interpretability, we employ a sparsity regularization sparselinear in the classification layer.
  • Figure 4: Visualizations of representative pose-based concept. (a) Dynamic pose concepts represent motion patterns across multiple frames. (b) Static pose concepts represent spatial configurations We also show the corresponding video class for each concept.
  • Figure 5: Concept contributions. We visualize the dynamic pose concepts corresponding to the top-3 concept contributions for a sample from "Jumping Jack" class.
  • ...and 4 more figures