Table of Contents
Fetching ...

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

Haojun Xu, Yan Gao, Jie Li, Xinbo Gao

TL;DR

This work tackles zero-shot skeleton-based action recognition by enriching semantic information beyond simple class names. It introduces InfoCPL, a framework that combines a multi-level alignment module (MLA) with a semantic embedding codebook and a selective feature ensemble (SFE) to generate diverse, fine-grained descriptions that better align with visual skeleton features. An attention-inverse mechanism (A_inv) and a flexible loss sampling strategy further stabilize training and promote robust decision surfaces, yielding strong gains on NTU-RGB+D 60/120 and PKU-MMD benchmarks. The results demonstrate improved discrimination of semantically and visually similar actions and establish InfoCPL as a robust approach for generalizing to unseen categories in skeleton-based action recognition.

Abstract

Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. However, these methods extract semantic features simply. They ignore that proper prompt design for rich and fine-grained action cues can provide robust representation space clustering. In order to alleviate the problem of insufficient information available for skeleton sequences, we design an information compensation learning framework from an information-theoretic perspective to improve zero-shot action recognition accuracy with a multi-granularity semantic interaction mechanism. Inspired by ensemble learning, we propose a multi-level alignment (MLA) approach to compensate information for action classes. MLA aligns multi-granularity embeddings with visual embedding through a multi-head scoring mechanism to distinguish semantically similar action names and visually similar actions. Furthermore, we introduce a new loss function sampling method to obtain a tight and robust representation. Finally, these multi-granularity semantic embeddings are synthesized to form a proper decision surface for classification. Significant action recognition performance is achieved when evaluated on the challenging NTU RGB+D, NTU RGB+D 120, and PKU-MMD benchmarks and validate that multi-granularity semantic features facilitate the differentiation of action clusters with similar visual features.

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

TL;DR

This work tackles zero-shot skeleton-based action recognition by enriching semantic information beyond simple class names. It introduces InfoCPL, a framework that combines a multi-level alignment module (MLA) with a semantic embedding codebook and a selective feature ensemble (SFE) to generate diverse, fine-grained descriptions that better align with visual skeleton features. An attention-inverse mechanism (A_inv) and a flexible loss sampling strategy further stabilize training and promote robust decision surfaces, yielding strong gains on NTU-RGB+D 60/120 and PKU-MMD benchmarks. The results demonstrate improved discrimination of semantically and visually similar actions and establish InfoCPL as a robust approach for generalizing to unseen categories in skeleton-based action recognition.

Abstract

Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. However, these methods extract semantic features simply. They ignore that proper prompt design for rich and fine-grained action cues can provide robust representation space clustering. In order to alleviate the problem of insufficient information available for skeleton sequences, we design an information compensation learning framework from an information-theoretic perspective to improve zero-shot action recognition accuracy with a multi-granularity semantic interaction mechanism. Inspired by ensemble learning, we propose a multi-level alignment (MLA) approach to compensate information for action classes. MLA aligns multi-granularity embeddings with visual embedding through a multi-head scoring mechanism to distinguish semantically similar action names and visually similar actions. Furthermore, we introduce a new loss function sampling method to obtain a tight and robust representation. Finally, these multi-granularity semantic embeddings are synthesized to form a proper decision surface for classification. Significant action recognition performance is achieved when evaluated on the challenging NTU RGB+D, NTU RGB+D 120, and PKU-MMD benchmarks and validate that multi-granularity semantic features facilitate the differentiation of action clusters with similar visual features.
Paper Structure (31 sections, 5 equations, 8 figures, 6 tables)

This paper contains 31 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The concept map for the zero-shot learning and our proposed method. During training, after the pre-trained language encoder extracts the description features, we let GPT generate diverse descriptions for each visible class, aligned with the visual features for multi-granularity alignment. When testing, the trained alignment network computes the similarity between the input samples and the unseen class description features to obtain the classification results.
  • Figure 2: We visualized the semantic features extracted from all categories in NTU 60 using t-SNE. The visualization in subfigure (a) shows cases where the class texts are similar, and the distance between their semantic features is relatively close. This can pose a potential confusion problem during testing. We visualize the aggregated semantic features of SFE modules in multiple branches for classes in subfigure (a) red dashed box, as shown in subfigure (b). Among them, circles and triangles indicate text features of classes with similar class names, respectively, the features in red and blue are the multiple description features of the two classes. The features in gold are the top-$k$ features selected by SFE for them. The results show that in these two similar classes, compared to the original class description features, the margin of the multiple features aggregated by our SFE module is larger, corresponding to stronger separability.
  • Figure 3: The pipeline of our framework. (a) Firstly, the positive example samples and negative example samples are pre-trained ST-GCN to extract the visual features. Then, multi-granularity scores are obtained after multi-head scoring averaging by our proposed MLA module, which consists of the Attention Inversion ($A_{inv}$) strategy and the Multiple Semantic Feature Ensemble (SFE) module. In this, the $A_{inv}$ strategy performs weight inversion to obtain an open semantic space for the semantic codebook composed of differentiated descriptive features at the early stage of training. The SFE selects from among the rich descriptive embeddings generated for each class to generate embeddings containing different granularities for alignment. Finally, MLA's multi-head scoring network synthesizes the multi-granularity embeddings in the semantic space to form decision surfaces with larger inter-class margin distance for classification. (b) Notation Summary: The visual features aligned with the high-dimensional semantic features are $\boldsymbol{u_{hd}}$ and $\boldsymbol{{u_{hd}}'}$, which are positive and negative examples of the features, respectively. The similarity scores from the corresponding outputs of the MLA are scored as $s$ and $s'$ for the computation of the loss function. (c) A unique advantage with SFE modules. As shown in the upper right corner of the figure, SFE can synthesize multiple features of the same semantic meaning into anchor points based on visual-semantic cross-attention, using the enriched semantic codebook to get a better alignment effect.
  • Figure 4: Comparisons of different $k$ for top-$k$ attention in MLA on NTU-60 datasets. The interval for $k$ is 5; the maximum number of branches is 20.
  • Figure 5: Intuitive visualization of $A_{inv}$ improvement on the single branch.
  • ...and 3 more figures