Table of Contents
Fetching ...

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, Hong Cheng

TL;DR

A novel skeleton-based training framework based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts and establishes the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs).

Abstract

Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: https://github.com/cseeyangchen/C2VL.

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

TL;DR

A novel skeleton-based training framework based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts and establishes the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs).

Abstract

Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (CVL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: https://github.com/cseeyangchen/C2VL.
Paper Structure (35 sections, 13 equations, 8 figures, 8 tables)

This paper contains 35 sections, 13 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Our proposed approach (C$^2$VL) utilizes the vision-language knowledge prompts as supervision to learn task-agnostic 3D human action representations.
  • Figure 2: The pipeline of our proposed approach. Before the pre-training phase, the vision and language knowledge prompts are generated regarding skeleton sequences by offline LMMs (Grounding DINO liu2023grounding and LLaVA liu2023llava) with text prompts and visual questions. In the pre-training phase, the skeleton data is utilized as the input for the skeleton encoder to learn action representation in the skeleton action space. The vision encoder and language encoder are employed to extract features from vision and language knowledge prompts, contributing to the creation of the vision-language action concept space that enhances fine-grained details not captured in the skeleton space. Subsequently, the degree to which pairs consisting of vision-language knowledge prompts and their corresponding skeleton are brought closer should be progressively guided by the intra-modal self-similarity and inter-modal cross-consistency softened targets. During the inference phase, we only utilize the former pre-trained skeleton encoder with a fully connected layer and skeleton data for skeleton-based action recognition without vision-language knowledge prompts.
  • Figure 3: The establishment of vision-language action concept space.
  • Figure 4: Histogram of similarity scores for positive pairs between skeleton and vision-language knowledge representations in cross-modal space after training with original InfoNCE loss.
  • Figure 5: Histogram of similarity scores for negative pairs between skeleton and vision-language knowledge representations in cross-modal space after training with original InfoNCE loss. The red ellipse highlights the movement of arms.
  • ...and 3 more figures