Table of Contents
Fetching ...

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Jinfu Liu, Chen Chen, Mengyuan Liu

TL;DR

This work addresses the limitations of skeleton-only action recognition by introducing Multi-Modality Co-Learning (MMCL), which injects complementary RGB and text information during training via multimodal LLMs while keeping inference lightweight with skeleton data only. MMCL comprises two modules: the Feature Alignment Module (FAM), which uses contrastive learning to align RGB-derived features with skeleton representations, and the Feature Refinement Module (FRM), which generates instructive text features through LLMs to refine classification scores. The training objective combines a standard classification loss with a contrastive loss and a refinement loss, enabling robust, generalizable representations and enabling domain-adaptive and zero-shot recognition. Experimental results on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA establish state-of-the-art performance for skeleton-based methods, while zero-shot and domain-adaptive evaluations on UTD-MHAD and SYSU-Action demonstrate strong generalization, with code released for reproducibility.

Abstract

Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

TL;DR

This work addresses the limitations of skeleton-only action recognition by introducing Multi-Modality Co-Learning (MMCL), which injects complementary RGB and text information during training via multimodal LLMs while keeping inference lightweight with skeleton data only. MMCL comprises two modules: the Feature Alignment Module (FAM), which uses contrastive learning to align RGB-derived features with skeleton representations, and the Feature Refinement Module (FRM), which generates instructive text features through LLMs to refine classification scores. The training objective combines a standard classification loss with a contrastive loss and a refinement loss, enabling robust, generalizable representations and enabling domain-adaptive and zero-shot recognition. Experimental results on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA establish state-of-the-art performance for skeleton-based methods, while zero-shot and domain-adaptive evaluations on UTD-MHAD and SYSU-Action demonstrate strong generalization, with code released for reproducibility.

Abstract

Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.
Paper Structure (21 sections, 10 equations, 5 figures, 10 tables)

This paper contains 21 sections, 10 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Existing methods suffer from inherent defects in single-modality and issues of inefficient inference. (a) Most skeleton-based methods only use skeleton/pose modality during training and inference stages, encountering issues associated with skeletal inherent defects. Note that the human pose can be divided into different skeleton modalities (e.g. joint and bone). (b) Most multimodal-based methods use multi-modality during the training and inference stages, which require significant inference resources and are inefficient. (c) Our multi-modality co-learning (MMCL) framework incorporates multimodal features to enhance the modeling of skeletons in the training stage and maintains efficiency in the inference stage by only using concise skeletons.
  • Figure 2: Framework of our proposed Multi-Modality Co-Learning (MMCL), which integrates multimodal features during the training stage and keeps efficiency in inference by only using concise skeletons. The Feature Alignment Module (FAM) extracts and aligns high-level RGB features to facilitate contrastive learning with global skeleton features. Here, we only align the RGB features with the skeleton features as the text features generated by LLMs have relatively limited information compared to the RGB features. The Feature Refinement Module (FRM) provides instructive text features to refine the classification scores based on the multimodal LLMs. Here we guide the LLMs to generate instructive features based on the defects that the skeleton cannot recognize objects. The text instructions can be modified based on the skeletal defects.
  • Figure 3: (a) Extract RGB images from video and use CNN to model RGB features. (b) Display of content generated by different multimodal LLMs. The text instructions used by MMCL are set based on skeletal defects (e.g. lack of object information and appearance details), guiding the LLMs to generate features complementary to the skeleton from RGB images. In implementation, we just use the BLIP li2022blip to generate text features for training.
  • Figure 4: Our MMCL can effectively perform action recognition when faced with skeleton/pose inputs from different domains. Our MMCL employs skeleton interpolation to ensure that the number of skeleton points input to the model is consistent.
  • Figure 5: Visualization of improved accuracy about difficult action samples when CTR-GCN used MMCL. The second column represents the prediction of models for the currently visualized sample and the third column represents the accuracy for all samples within the currently visualized category. We selected four difficult action samples that are prone to prediction errors in CTR-GCN, which all belong to categories involving objects or are highly relevant to hands and is difficult to distinguish objects from skeleton diagrams. Our MMCL set text instructions based on skeletal defects and generated instructive features through LLMs to guide the model to focus on the modeling of human hands and objects, thus leading to significant accuracy improvements in these difficult action samples.