Table of Contents
Fetching ...

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

Yang Chen, Jingcai Guo, Song Guo, Dacheng Tao

TL;DR

This work proposes a novel dyNamically Evolving dUal skeleton-semantic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively.

Abstract

Zero-shot skeleton action recognition is a non-trivial task that requires robust unseen generalization with prior knowledge from only seen classes and shared semantics. Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. Concretely, 1) we first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information to capture the intricate and synergistic skeleton-semantic correlations step-by-step, progressively refining cross-model alignment; and 2) we introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes, enabling them to absorb structure-related spatial representations and regularity-dependent temporal patterns. Notably, such processes are analogous to the learning and growth of neurons, equipping the framework with the capacity to generalize to novel unseen action categories. Extensive experiments on various benchmark datasets demonstrated the superiority of the proposed method.

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

TL;DR

This work proposes a novel dyNamically Evolving dUal skeleton-semantic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively.

Abstract

Zero-shot skeleton action recognition is a non-trivial task that requires robust unseen generalization with prior knowledge from only seen classes and shared semantics. Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. Concretely, 1) we first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information to capture the intricate and synergistic skeleton-semantic correlations step-by-step, progressively refining cross-model alignment; and 2) we introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes, enabling them to absorb structure-related spatial representations and regularity-dependent temporal patterns. Notably, such processes are analogous to the learning and growth of neurons, equipping the framework with the capacity to generalize to novel unseen action categories. Extensive experiments on various benchmark datasets demonstrated the superiority of the proposed method.

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Method sketches: (a) Existing methods employ homogeneous semantics (manual-designed or one-turn LLMs) to align the skeleton space with one-step (static); (b) Our Neuron introduces the context-aware side information (multi-turn LLMs) and evolving micro-prototypes to capture cross-modal correspondence from micro to macro perspectives (dynamic) for controllable alignment.
  • Figure 2: The pipeline of the proposed method. (a) represents the evolving spatial-temporal representation learning process. (b) shows the stepwise skeleton-semantic alignment.
  • Figure 3: The influence of hyper-parameters on the NTU 60.
  • Figure 4: (a) - (c) represent the t-SNE visualization of the skeleton spatial spaces for unseen categories on different phases. The color denotes different unseen categories from the cross-subject task of the NTU 60 dataset under the 55/5 split settings. (d) represents the spatial intra-class compactness metrics of corresponding phases (zoom in for better view).
  • Figure 5: (a) - (c) represent the visualization of updated temporal micro-prototype in each phase for a randomly selected skeleton sequence. The first row denotes the updated process without the proposed temporal memory mechanism, while the second row is equipped. (d) represents the temporal intra-class compactness metrics of corresponding phases (zoom in for better view).
  • ...and 1 more figures