Table of Contents
Fetching ...

Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

Zhigang Tu, Zhengbo Zhang, Jia Gong, Junsong Yuan, Bo Du

TL;DR

This work tackles skeleton-based action recognition under limited annotations by reframing semi-supervised 3D Action Recognition via Active Learning (S3ARAL) as a Markov Decision Process. It introduces an Informative Sample Selection Model (ISSM) trained with Double DQN, where state representations are projected to hyperbolic space to better capture hierarchical skeleton structure, and rewards reflect the action recognizer's performance gains. A meta-tuning strategy based on meta-learning accelerates deployment when expanding labeled data. Across three benchmarks (UWA3D, NW-UCLA, NTU RGB+D 60), the method achieves state-of-the-art accuracy over varying labeling budgets and demonstrates strong generalization, with ablations confirming the value of hyperbolic representations, MMD-based state gaps, and meta-tuning.

Abstract

Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.

Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

TL;DR

This work tackles skeleton-based action recognition under limited annotations by reframing semi-supervised 3D Action Recognition via Active Learning (S3ARAL) as a Markov Decision Process. It introduces an Informative Sample Selection Model (ISSM) trained with Double DQN, where state representations are projected to hyperbolic space to better capture hierarchical skeleton structure, and rewards reflect the action recognizer's performance gains. A meta-tuning strategy based on meta-learning accelerates deployment when expanding labeled data. Across three benchmarks (UWA3D, NW-UCLA, NTU RGB+D 60), the method achieves state-of-the-art accuracy over varying labeling budgets and demonstrates strong generalization, with ablations confirming the value of hyperbolic representations, MMD-based state gaps, and meta-tuning.

Abstract

Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.

Paper Structure

This paper contains 18 sections, 9 equations, 2 figures, 6 tables, 1 algorithm.

Figures (2)

  • Figure 1: Motivation of our method. Previous semi-supervised 3D action recognition via active learning (S3ARAL) approaches li2023sar rely on margin-based selection strategy that aims to identify representative samples. However, such strategy may select samples with limited novel information, leading to sub-optimal performance of the trained action recognizer. In contrast, we propose a novel perspective by reformulating S3ARAL as a Markov Decision Process (MDP) mnih2015human. Within this framework, we train an informative sample selection model to intelligently choose training samples that are more likely to improve the action recognizer’s performance, thereby enabling the training of a more effective model.
  • Figure 2: The overall pipeline of our method. In the task of semi-supervised 3D action recognition via active learning, given an unlabeled dataset and an annotation budget, we divide the dataset into a labeled set, an unlabeled set, and a reward set. We then design an Informative Sample Selection Model (ISSM) to select informative samples for annotation. The annotated samples are used to train the action recognizer. To ensure that the ISSM receives sufficient information for effective sample selection, we carefully construct the state-action pairs. The state encodes the distribution gap between the labeled and unlabeled sets, along with the budget consumption ratio. The action captures both the sample’s potential contribution to improving the action recognizer and its representativeness. To enhance the expressiveness of the state-action representations for the skeleton-based action recognition task, we project them from the Euclidean space to hyperbolic space. This is motivated by the exponential volume growth of hyperbolic space, which is well-suited for modeling the hierarchical structure of human skeletons. The performance improvement of the action recognizer between consecutive iterations is treated as the reward for training the ISSM.