Table of Contents
Fetching ...

Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

Yang Chen, Miaoge Li, Zhijie Rao, Deze Zeng, Song Guo, Jingcai Guo

TL;DR

Zero-shot skeleton action recognition suffers from fragile semantic anchors and rigid decision boundaries when aligning skeletons to semantics. Flora addresses this by learning neighbor-aware semantic representations with a geometric-consistency cross-modal VAE alignment, and by deciding with a noise-free, open-form flow classifier enhanced by contrastive regularization, enabling token-level, distribution-aware discrimination. The approach achieves state-of-the-art or competitive results across NTU-60, NTU-120, and PKU-MMD, including strong performance in low-shot settings, and demonstrates robustness to cross-view and cross-setup conditions. By separating learning (neighbor-aware semantics and robust alignment) from flexible deciding (flow-based transport without noise/conditioning), Flora offers a generalizable framework for open-world zero-shot skeleton understanding and future extensions such as skeleton-specific semantics and improved low-shot learning.

Abstract

Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an "align-then-classify" paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $\texttt{$\textbf{Flora}$}$, which builds upon $\textbf{F}$lexib$\textbf{L}$e neighb$\textbf{O}$r-aware semantic attunement and open-form dist$\textbf{R}$ibution-aware flow cl$\textbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10\% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.

Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

TL;DR

Zero-shot skeleton action recognition suffers from fragile semantic anchors and rigid decision boundaries when aligning skeletons to semantics. Flora addresses this by learning neighbor-aware semantic representations with a geometric-consistency cross-modal VAE alignment, and by deciding with a noise-free, open-form flow classifier enhanced by contrastive regularization, enabling token-level, distribution-aware discrimination. The approach achieves state-of-the-art or competitive results across NTU-60, NTU-120, and PKU-MMD, including strong performance in low-shot settings, and demonstrates robustness to cross-view and cross-setup conditions. By separating learning (neighbor-aware semantics and robust alignment) from flexible deciding (flow-based transport without noise/conditioning), Flora offers a generalizable framework for open-world zero-shot skeleton understanding and future extensions such as skeleton-specific semantics and improved low-shot learning.

Abstract

Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an "align-then-classify" paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed \textbf{Flora}, which builds upon lexibe neighbr-aware semantic attunement and open-form distibution-aware flow clssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10\% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.

Paper Structure

This paper contains 23 sections, 15 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Overview of our Flora versus previous methods.
  • Figure 2: The pipeline of our method, including the learning and deciding phases (zoom in for a better view).
  • Figure 3: Performance comparison on NTU-60 and NTU-120 with different timestep selection $t$ in the inference phase.
  • Figure 4: Neighbor selection analysis with corresponding semantic similarity scores on NTU-60 and NTU-120.
  • Figure 5: Flow velocity visualization in the deciding phase on NTU-60 (55/5 Split). Each pair shows distribution transport from the semantic (left) to the skeleton (right) space, with red and blue arrows denoting target and predicted velocities (zoom in for a better view).
  • ...and 7 more figures