Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Yang Chen; Jingcai Guo; Tian He; Ling Wang

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Yang Chen, Jingcai Guo, Tian He, Ling Wang

TL;DR

This work proposes a novel method via Side information and dual-prompTs learning for skeleton-based zero-shot Action Recognition (STAR) at the fine-grained level that achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.

Abstract

Skeleton-based zero-shot action recognition aims to recognize unknown human actions based on the learned priors of the known skeleton-based actions and a semantic descriptor space shared by both known and unknown categories. However, previous works focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories. To address these challenges, we propose a novel method via Side information and dual-prompts learning for skeleton-based zero-shot action recognition (STAR) at the fine-grained level. Specifically, 1) we decompose the skeleton into several parts based on its topology structure and introduce the side information concerning multi-part descriptions of human body movements for alignment between the skeleton and the semantic space at the fine-grained level; 2) we design the visual-attribute and semantic-part prompts to improve the intra-class compactness within the skeleton space and inter-class separability within the semantic space, respectively, to distinguish the high-similarity actions. Extensive experiments show that our method achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

TL;DR

Abstract

Paper Structure (19 sections, 9 equations, 5 figures, 7 tables)

This paper contains 19 sections, 9 equations, 5 figures, 7 tables.

Introduction
Related Work
Attention-based Zero-Shot Learning
Skeleton-based Zero-Shot Action Recognition
Method
Problem Definition
Fine-grained Formulation
Dual-Prompt Cross-Modality Alignment
Model Optimization
ZSL/GZSL Prediction
Experiments
Datasets
Evaluation Protocols
Implementation Details
Baseline Settings
...and 4 more sections

Figures (5)

Figure 1: Methods comparison. (a) Existing skeleton-based zero-shot action recognition methods project the global embedding of skeleton sequences into semantic space for alignment with category names, neglecting the potential correlation at the fine-grained level; (b) Our STAR decomposes the human skeleton into several regions based on its topology structure and introducing the extra side information of part motion descriptions for alignment at the fine-grained level, enabling significant capacities of transferring knowledge from known to unknown categories.
Figure 2: The architecture of the proposed STAR model. In the skeleton stream, we utilize the GCN backbone to extract skeleton representations and then decompose them into several parts based on topology-based partition strategies. The attention-based mechanism and the visual-attribute prompt are devised to improve the intra-class compactness in skeleton space by fully exploring and capturing spatial-temporal characters of the actions. In the semantic stream, we generate the part descriptions of the action as the side information to supply extra fine-grained knowledge. After that, we propose the semantic-part prompt to improve the inter-class separability of these side information with the constraint of the action category name. Finally, we align the multi-part skeleton representations and the corresponding semantic embeddings with the guidance of several losses.
Figure 3: The influence of hyper-parameters on the NTU RGB+D 60 and the PKU-MMD II datasets.
Figure 4: t-SNE visualizations of skeleton and semantic spaces for known and unknown categories. The color denotes different known/unknow categories random selected from the cross-subject task of NTU RGB+D 60 dataset under the 55/5 split settings. The first row ((a) - (d)) represents the skeleton space, and the second row ((e) - (h)) represents the semantic space.
Figure 5: Confusion matrices for unknown categories on the cross-subject task of NTU RGB+D 60 under the 55/5 split setting. (a) represents the confusion matrix of SMIE zhou2023zero method. (b) represents the confusion matrix of STAR (our method).

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

TL;DR

Abstract

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)