Table of Contents
Fetching ...

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

Anqi Zhu, Qiuhong Ke, Mingming Gong, James Bailey

TL;DR

The paper tackles zero shot skeleton action recognition, showing that relying solely on global features and label semantics limits transfer of local movements. It introduces PURLS, which enriches action labels with global and local descriptions via GPT-3 prompting and uses an adaptive partitioning module to harvest semantically relevant visual cues from skeleton data, aligned to these descriptions through CLIP style embeddings. A cross modal contrastive objective across multiple global and local representations enables effective transfer to unseen classes, yielding state of the art results on NTU-RGB+D 60/120 and Kinetics-skeleton 200 while remaining compatible with different skeleton backbones. The approach demonstrates strong generalization and universality, with public code available for replication and further research.

Abstract

While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

TL;DR

The paper tackles zero shot skeleton action recognition, showing that relying solely on global features and label semantics limits transfer of local movements. It introduces PURLS, which enriches action labels with global and local descriptions via GPT-3 prompting and uses an adaptive partitioning module to harvest semantically relevant visual cues from skeleton data, aligned to these descriptions through CLIP style embeddings. A cross modal contrastive objective across multiple global and local representations enables effective transfer to unseen classes, yielding state of the art results on NTU-RGB+D 60/120 and Kinetics-skeleton 200 while remaining compatible with different skeleton backbones. The approach demonstrates strong generalization and universality, with public code available for replication and further research.

Abstract

While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.
Paper Structure (17 sections, 6 equations, 4 figures, 6 tables)

This paper contains 17 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Examples of a seen class ('Hit another person with something') and an unseen class ('Shoot at the basket') from NTU-RGB+D 120ntu120. While humans can quickly identify their similar hand movements and use this knowledge to distinguish the new class from other unseen classes, label-based global feature learning does not facilitate the direct transfer of such local knowledge.
  • Figure 2: Architecture diagram for PURLS. The matching action label is sent to GPT-3 gpt3 to obtain detailed descriptions for its global/local body movements, whose textual features are generated by a pre-trained language encoder of CLIP clip. The visual features of the input skeleton sequence $I$ can be extracted from an arbitrary skeleton backbone $g$ (e.g., Shift-GCN shift-gcn) pre-trained on the seen classes. The output $G$ is then fed to the partitioning module $r$ to group the joint-level features into global and spatially/temporally-local representations in an adaptive manner, which are later projected and aligned with their corresponding description embeddings.
  • Figure 3: Spatial partitioning scheme for decomposing body joints into four body parts: (i) Head, (ii) Hands, (iii) Torso, (iv) Legs.
  • Figure 4: Illustration of how the adaptive partitioning module samples local visual representations.