Table of Contents
Fetching ...

One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

Siyuan Yang, Jun Liu, Shijian Lu, Er Meng Hwa, Alex C. Kot

TL;DR

A novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching, which achieves superior one-shot skeleton action recognition, and outperforms SOTA consistently by large margins.

Abstract

One-shot skeleton action recognition, which aims to learn a skeleton action recognition model with a single training sample, has attracted increasing interest due to the challenge of collecting and annotating large-scale skeleton action data. However, most existing studies match skeleton sequences by comparing their feature vectors directly which neglects spatial structures and temporal orders of skeleton data. This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching. We represent skeleton data at multiple spatial and temporal scales and achieve optimal feature matching from two perspectives. The first is multi-scale matching which captures the scale-wise semantic relevance of skeleton data at multiple spatial and temporal scales simultaneously. The second is cross-scale matching which handles different motion magnitudes and speeds by capturing sample-wise relevance across multiple scales. Extensive experiments over three large-scale datasets (NTU RGB+D, NTU RGB+D 120, and PKU-MMD) show that our method achieves superior one-shot skeleton action recognition, and it outperforms the state-of-the-art consistently by large margins.

One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

TL;DR

A novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching, which achieves superior one-shot skeleton action recognition, and outperforms SOTA consistently by large margins.

Abstract

One-shot skeleton action recognition, which aims to learn a skeleton action recognition model with a single training sample, has attracted increasing interest due to the challenge of collecting and annotating large-scale skeleton action data. However, most existing studies match skeleton sequences by comparing their feature vectors directly which neglects spatial structures and temporal orders of skeleton data. This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching. We represent skeleton data at multiple spatial and temporal scales and achieve optimal feature matching from two perspectives. The first is multi-scale matching which captures the scale-wise semantic relevance of skeleton data at multiple spatial and temporal scales simultaneously. The second is cross-scale matching which handles different motion magnitudes and speeds by capturing sample-wise relevance across multiple scales. Extensive experiments over three large-scale datasets (NTU RGB+D, NTU RGB+D 120, and PKU-MMD) show that our method achieves superior one-shot skeleton action recognition, and it outperforms the state-of-the-art consistently by large margins.
Paper Structure (23 sections, 9 equations, 9 figures, 13 tables)

This paper contains 23 sections, 9 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Skeleton action recognition based on feature similarity or feature matching: Feature similarity computes the distance between feature vectors which discards the very useful spatial skeleton structures and temporal information. The proposed feature matching compares two skeleton sequences by computing a matching flow between their feature distributions which can capture useful spatial and temporal information effectively. The colored line emphasizes channels paired based on their high matching scores.
  • Figure 2: The proposed multi-scale skeleton modeling at spatial dimension in (a) and temporal dimension in (b): Given the original spatial scale at Scale 1 , we first divide the skeleton nodes into multiple groups with similar semantic skeleton structures and then perform average pooling to each group to generate skeleton graphs of coarser scales (i.e., Scale 2 and Scale 3.) The nodes whose links are of the same color belong to the same group with similar semantics. Along the temporal dimension, we perform average pooling over features of adjacent frames to obtain temporal features of coarser scales at Scale 2 and Scale 3. (Details of spatial-pooling are available in the appendix.)
  • Figure 3: The pipeline of the proposed method: Given a Query Sequence and two support skeleton sequences Support Sequence 1 and Support Sequence 2, skeleton representations are first extracted with a weight-sharing embedding network which are further aligned progressively with the proposed Optimal Matching. The semantic relevance score (denoted by $s(\cdot, \cdot)$) between the query and the two support instances can then be computed for action prediction. The pipeline is illustrated with a 2-way 1-shot task. Different Optimal Matching strategies are provided in Fig. \ref{['fig:cross optimal match']}.
  • Figure 4: Illustration of optimal matching in Spatial Matching in (a) and Temporal Matching in (b): In each of the two sub-figures, the three matching strategies along the diagonal (in blue color) illustrate the proposed multi-scale matching, and the rest 6 of the diagonal (in orange color) show the proposed cross-scale matching. Here, $s(\cdot, \cdot)$ represents the semantic relevance score between two skeleton features. X and Y stand for two skeleton sequences.
  • Figure 5: Illustration of the single-scale embedding network (AGCN Shi_2019_CVPR_twostream). There are a total of 9 AGC blocks, followed by a global average layer and a softmax classifier. (N denotes the number of skeleton joints, T denotes the number of frames, and $C^{(i)}$ denotes the number of output channels at $i^{th}$ block.)
  • ...and 4 more figures