Table of Contents
Fetching ...

Enhancing Action Recognition from Low-Quality Skeleton Data via Part-Level Knowledge Distillation

Cuiwei Liu, Youzhi Jiang, Chong Du, Zhaokui Li

TL;DR

The paper tackles action recognition from noisy, low-quality skeleton data by introducing a general teacher–student knowledge-distillation framework that transfers fine-grained part-level knowledge from high-quality skeletons. It leverages a part-based skeleton matching strategy, an action-specific high-efficiency part matrix, and a part-level multi-sample contrastive loss to align part representations across heterogeneous pose graphs and even when only solitary low-quality samples are available. The approach yields consistent improvements over strong baselines and state-of-the-art methods across NTU-RGB+D, Penn Action, and SYSU 3D HOI, demonstrating robustness to occlusion and missing joints in real-world settings. These results highlight the practical impact for robust, real-time action recognition in scenarios with limited or degraded pose information.

Abstract

Skeleton-based action recognition is vital for comprehending human-centric videos and has applications in diverse domains. One of the challenges of skeleton-based action recognition is dealing with low-quality data, such as skeletons that have missing or inaccurate joints. This paper addresses the issue of enhancing action recognition using low-quality skeletons through a general knowledge distillation framework. The proposed framework employs a teacher-student model setup, where a teacher model trained on high-quality skeletons guides the learning of a student model that handles low-quality skeletons. To bridge the gap between heterogeneous high-quality and lowquality skeletons, we present a novel part-based skeleton matching strategy, which exploits shared body parts to facilitate local action pattern learning. An action-specific part matrix is developed to emphasize critical parts for different actions, enabling the student model to distill discriminative part-level knowledge. A novel part-level multi-sample contrastive loss achieves knowledge transfer from multiple high-quality skeletons to low-quality ones, which enables the proposed knowledge distillation framework to include training low-quality skeletons that lack corresponding high-quality matches. Comprehensive experiments conducted on the NTU-RGB+D, Penn Action, and SYSU 3D HOI datasets demonstrate the effectiveness of the proposed knowledge distillation framework.

Enhancing Action Recognition from Low-Quality Skeleton Data via Part-Level Knowledge Distillation

TL;DR

The paper tackles action recognition from noisy, low-quality skeleton data by introducing a general teacher–student knowledge-distillation framework that transfers fine-grained part-level knowledge from high-quality skeletons. It leverages a part-based skeleton matching strategy, an action-specific high-efficiency part matrix, and a part-level multi-sample contrastive loss to align part representations across heterogeneous pose graphs and even when only solitary low-quality samples are available. The approach yields consistent improvements over strong baselines and state-of-the-art methods across NTU-RGB+D, Penn Action, and SYSU 3D HOI, demonstrating robustness to occlusion and missing joints in real-world settings. These results highlight the practical impact for robust, real-time action recognition in scenarios with limited or degraded pose information.

Abstract

Skeleton-based action recognition is vital for comprehending human-centric videos and has applications in diverse domains. One of the challenges of skeleton-based action recognition is dealing with low-quality data, such as skeletons that have missing or inaccurate joints. This paper addresses the issue of enhancing action recognition using low-quality skeletons through a general knowledge distillation framework. The proposed framework employs a teacher-student model setup, where a teacher model trained on high-quality skeletons guides the learning of a student model that handles low-quality skeletons. To bridge the gap between heterogeneous high-quality and lowquality skeletons, we present a novel part-based skeleton matching strategy, which exploits shared body parts to facilitate local action pattern learning. An action-specific part matrix is developed to emphasize critical parts for different actions, enabling the student model to distill discriminative part-level knowledge. A novel part-level multi-sample contrastive loss achieves knowledge transfer from multiple high-quality skeletons to low-quality ones, which enables the proposed knowledge distillation framework to include training low-quality skeletons that lack corresponding high-quality matches. Comprehensive experiments conducted on the NTU-RGB+D, Penn Action, and SYSU 3D HOI datasets demonstrate the effectiveness of the proposed knowledge distillation framework.
Paper Structure (25 sections, 13 equations, 7 figures, 5 tables)

This paper contains 25 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Visualization of skeleton data. (a) 3D poses generated by Microsoft Kinect V.2. (b) 2D poses estimated by PifPaf kreiss2019pifpaf with MobileNetv3Small howard2019searching as the backbone. (c) 2D poses estimated by PifPaf with ShufflfleNetv2x1 ma2018shufflenet as the backbone. (d) 2D poses estimated by PifPaf with ResNet152 he2016deep as the backbone.
  • Figure 2: Visualization of two skeleton graphs. (a) The skeleton graph produced by Microsoft Kinect V.2 contains 25 joints. (b) The skeleton graph obtained by the PifPaf pose estimator kreiss2019pifpaf contains 17 joints.
  • Figure 3: Train pipeline of the proposed knowledge distillation framework for skeleton-based action recognition. It includes two processes: pre-training of teacher network (dashed lines) and training of student network (solid lines).
  • Figure 4: Architecture of the GCN blocks built upon ST-GCN yan2018spatial.
  • Figure 5: Visualization of the action-specific high-efficiency part matrix on the NTU-RGB+D dataset. Large values are represented in dark blue, while small values are expressed in pale. This figure is best seen in color.
  • ...and 2 more figures