Table of Contents
Fetching ...

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Lilang Lin, Lehong Wu, Jiahang Zhang, Jiaying Liu

TL;DR

A novel skeleton-based idempotent generative model (IGM) for unsupervised representation learning is proposed that introduces the idempotency constraint to form a stronger consistency regularization in the feature space, to push the features only to maintain the critical information of motion semantics for the recognition task.

Abstract

Generative models, as a powerful technique for generation, also gradually become a critical tool for recognition tasks. However, in skeleton-based action recognition, the features obtained from existing pre-trained generative methods contain redundant information unrelated to recognition, which contradicts the nature of the skeleton's spatially sparse and temporally consistent properties, leading to undesirable performance. To address this challenge, we make efforts to bridge the gap in theory and methodology and propose a novel skeleton-based idempotent generative model (IGM) for unsupervised representation learning. More specifically, we first theoretically demonstrate the equivalence between generative models and maximum entropy coding, which demonstrates a potential route that makes the features of generative models more compact by introducing contrastive learning. To this end, we introduce the idempotency constraint to form a stronger consistency regularization in the feature space, to push the features only to maintain the critical information of motion semantics for the recognition task. Our extensive experiments on benchmark datasets, NTU RGB+D and PKUMMD, demonstrate the effectiveness of our proposed method. On the NTU 60 xsub dataset, we observe a performance improvement from 84.6$\%$ to 86.2$\%$. Furthermore, in zero-shot adaptation scenarios, our model demonstrates significant efficacy by achieving promising results in cases that were previously unrecognizable. Our project is available at \url{https://github.com/LanglandsLin/IGM}.

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

TL;DR

A novel skeleton-based idempotent generative model (IGM) for unsupervised representation learning is proposed that introduces the idempotency constraint to form a stronger consistency regularization in the feature space, to push the features only to maintain the critical information of motion semantics for the recognition task.

Abstract

Generative models, as a powerful technique for generation, also gradually become a critical tool for recognition tasks. However, in skeleton-based action recognition, the features obtained from existing pre-trained generative methods contain redundant information unrelated to recognition, which contradicts the nature of the skeleton's spatially sparse and temporally consistent properties, leading to undesirable performance. To address this challenge, we make efforts to bridge the gap in theory and methodology and propose a novel skeleton-based idempotent generative model (IGM) for unsupervised representation learning. More specifically, we first theoretically demonstrate the equivalence between generative models and maximum entropy coding, which demonstrates a potential route that makes the features of generative models more compact by introducing contrastive learning. To this end, we introduce the idempotency constraint to form a stronger consistency regularization in the feature space, to push the features only to maintain the critical information of motion semantics for the recognition task. Our extensive experiments on benchmark datasets, NTU RGB+D and PKUMMD, demonstrate the effectiveness of our proposed method. On the NTU 60 xsub dataset, we observe a performance improvement from 84.6 to 86.2. Furthermore, in zero-shot adaptation scenarios, our model demonstrates significant efficacy by achieving promising results in cases that were previously unrecognizable. Our project is available at \url{https://github.com/LanglandsLin/IGM}.

Paper Structure

This paper contains 15 sections, 1 theorem, 22 equations, 4 figures, 6 tables.

Key Result

theorem thmcountertheorem

If $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_m$ are the eigenvalues of $\mathbf{A}$, and if the clustering purity is $1 - \alpha$, we obtain: where $c_1, c_2$ are some constants.

Figures (4)

  • Figure 1: We perform data augmentations on the data first and then obtain the conditional features through the encoder $f(\cdot)$. The noise skeleton is then obtained using Diffusion Sampling. The noise skeleton and conditions are fed into the generator $g(\cdot)$ for denoising. The adapter $h(\cdot)$ plays a pivotal role in projecting and fusing the features extracted by the encoder $f(\cdot)$ into the generator's feature space for use as conditions. In the adapter, (a) involves computing similarity using spatio-temporal tokens within the sequence. (b) calculates similar tokens based on the similarity of each token. (c) entails de-correlation by subtracting similar tokens. This integration expands the effective feature dimension of the feature space, facilitating more robust and comprehensive representation. We utilize two losses in our model: Diffusion's noise prediction loss and idempotent feature constraints, which respectively constrain feature similarity and distributional similarity. Thus, the feature consistency is improved, leading to not only improved recognition capture but also the perceptual reconstruction quality of the generative model.
  • Figure 2: Curve of singular values with the singular value index.
  • Figure 3: Visualisation of features in ground truth data and generated data.
  • Figure 4: Visualisations of ground truth data and generated data. Above is the ground truth data, and below is the generated data. The conditions provided by the encoder are incorporated with data transformation, resulting in generated data that maintain similar motion information while exhibiting some diversity.

Theorems & Definitions (1)

  • theorem thmcountertheorem