Table of Contents
Fetching ...

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

Yiping Wei, Kunyu Peng, Alina Roitberg, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen

TL;DR

This work tackles the challenge of transferring knowledge across inconsistent skeleton modalities in self-supervised action recognition. It introduces an Implicit Knowledge Exchange Module (IKEM) to enable cross-modality information sharing without relying on additional positives, adds accelerations, rotation-axis directions, and angular velocities as new modalities, and employs a MoCo-based relational cross-modality knowledge distillation to transfer knowledge from a six-modality teacher to a three-modality student. The method improves over prior baselines on NTU-60, with six-modality fusion achieving notable gains and the student model maintaining efficiency while preserving accuracy. The proposed framework demonstrates that careful management of modality interactions and distillation can unlock efficient, scalable use of skeleton-based multi-modality data for self-supervised action recognition, with public code forthcoming.

Abstract

Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at https://github.com/desehuileng0o0/IKEM.

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

TL;DR

This work tackles the challenge of transferring knowledge across inconsistent skeleton modalities in self-supervised action recognition. It introduces an Implicit Knowledge Exchange Module (IKEM) to enable cross-modality information sharing without relying on additional positives, adds accelerations, rotation-axis directions, and angular velocities as new modalities, and employs a MoCo-based relational cross-modality knowledge distillation to transfer knowledge from a six-modality teacher to a three-modality student. The method improves over prior baselines on NTU-60, with six-modality fusion achieving notable gains and the student model maintaining efficiency while preserving accuracy. The proposed framework demonstrates that careful management of modality interactions and distillation can unlock efficient, scalable use of skeleton-based multi-modality data for self-supervised action recognition, with public code forthcoming.

Abstract

Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at https://github.com/desehuileng0o0/IKEM.
Paper Structure (10 sections, 8 equations, 1 figure, 1 table)

This paper contains 10 sections, 8 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: An overview of our pre-training model in red dashed box and teacher-student model in blue dashed box, where module (a) is the knowledge exchange module in CrosSCLR, module (b) is our proposed IKEM, and module (c) is the knowledge distillation module for our teacher-student model. $\widetilde{g}$ are the new MLPs introduced by IKEM and $\widetilde{g}'$ denote the MLPs of the student model. MB is the abbreviation for memory bank. All the modules in the figure use the update of the encoder from joint modality as an example.