Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition
Anqi Zhu, Qiuhong Ke, Mingming Gong, James Bailey
TL;DR
The paper tackles zero shot skeleton action recognition, showing that relying solely on global features and label semantics limits transfer of local movements. It introduces PURLS, which enriches action labels with global and local descriptions via GPT-3 prompting and uses an adaptive partitioning module to harvest semantically relevant visual cues from skeleton data, aligned to these descriptions through CLIP style embeddings. A cross modal contrastive objective across multiple global and local representations enables effective transfer to unseen classes, yielding state of the art results on NTU-RGB+D 60/120 and Kinetics-skeleton 200 while remaining compatible with different skeleton backbones. The approach demonstrates strong generalization and universality, with public code available for replication and further research.
Abstract
While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.
