SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Sheng-Wei Li; Zi-Xiang Wei; Wei-Jie Chen; Yi-Hsin Yu; Chih-Yuan Yang; Jane Yung-jen Hsu

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Sheng-Wei Li, Zi-Xiang Wei, Wei-Jie Chen, Yi-Hsin Yu, Chih-Yuan Yang, Jane Yung-jen Hsu

TL;DR

SA-DVAE tackles zero-shot skeleton-based action recognition by disentangling the skeleton latent space into semantic-related $z^r_x$ and semantic-irrelevant $z^v_x$ components and aligning only the semantic-related part with text latent $z_y$ via two modality-specific VAEs. An adversarial discriminator enforces independence between $z^r_x$ and $z^v_x$ through a total correlation penalty, while a cross-alignment loss ties cross-modal reconstructions to the corresponding modalities. Experiments on NTU-60, NTU-120, and PKU-MMD demonstrate state-of-the-art ZSL and GZSL performance, with substantial gains on unseen classes and balanced harmonic means; ablations show that feature disentanglement drives most gains and the TC penalty strengthens generalization. By addressing the asymmetry between skeleton and text modalities, the approach yields robust cross-modal representations for zero-shot action recognition and can be further enhanced with pose canonicalization and LLM-augmented descriptions.

Abstract

Existing zero-shot skeleton-based action recognition methods utilize projection networks to learn a shared latent space of skeleton features and semantic embeddings. The inherent imbalance in action recognition datasets, characterized by variable skeleton sequences yet constant class labels, presents significant challenges for alignment. To address the imbalance, we propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders, a method that first adopts feature disentanglement to separate skeleton features into two independent parts -- one is semantic-related and another is irrelevant -- to better align skeleton and semantic features. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. We conduct experiments on three benchmark datasets: NTU RGB+D, NTU RGB+D 120 and PKU-MMD, and our experimental results show that SA-DAVE produces improved performance over existing methods. The code is available at https://github.com/pha123661/SA-DVAE.

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

TL;DR

SA-DVAE tackles zero-shot skeleton-based action recognition by disentangling the skeleton latent space into semantic-related

and semantic-irrelevant

components and aligning only the semantic-related part with text latent

via two modality-specific VAEs. An adversarial discriminator enforces independence between

and

through a total correlation penalty, while a cross-alignment loss ties cross-modal reconstructions to the corresponding modalities. Experiments on NTU-60, NTU-120, and PKU-MMD demonstrate state-of-the-art ZSL and GZSL performance, with substantial gains on unseen classes and balanced harmonic means; ablations show that feature disentanglement drives most gains and the TC penalty strengthens generalization. By addressing the asymmetry between skeleton and text modalities, the approach yields robust cross-modal representations for zero-shot action recognition and can be further enhanced with pose canonicalization and LLM-augmented descriptions.

Abstract

Paper Structure (11 sections, 13 equations, 6 figures, 16 tables)

This paper contains 11 sections, 13 equations, 6 figures, 16 tables.

Introduction
Related Work
Skeleton-Based Zero-Shot Action Recognition.
Methodology
Experiments
Conclusion
Hyperparameter Search Space and Sensitivity
Feature Extractors
Combining with Existing Methods
Pose Canonicalization on Skeleton Data
Enhanced Class Descriptions by a Large Language Model (LLM)

Figures (6)

Figure 1: Comparison with existing methods. Our method is the first to apply feature disentanglement to the problem of skeleton-based zero-shot action recognition. All existing methods directly align skeleton features with textual ones, but ours only aligns a part of semantic-related skeleton features with the textual ones.
Figure 2: System Architecture of SA-DVAE. Initially, the feature extractors are employed to extract features. Subsequently, the cross-modal alignment module aligns the two modalities and generates semantic-related unseen skeleton features ($z^r_x$). These generated features are utilized to train classifiers.
Figure 3: Cross-Modal Alignment Module. This module serves two primary tasks: latent space construction through self-reconstruction and cross-modal alignment via cross-reconstruction. The skeleton features are disentangled into semantic-related ($z^r_x$) and irrelevant ($z^v_x$) factors.
Figure 4: t-SNE visualizations of $z^r_x$ and $z^v_x$. Best viewed in color.
Figure 5: Unseen per-class accuracy of the NTU-60 dataset. The unseen split {1, 9, 16, 29, 47} is used in a challenging run of our random-split GZSL experiments.
...and 1 more figures

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

TL;DR

Abstract

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (6)