Table of Contents
Fetching ...

USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation

Wanjiang Weng, Hongsong Wang, Junbo Wang, Lei He, Guosen Xie

TL;DR

The paper addresses the inefficiencies of negative-based self-supervised methods in skeleton-based representation learning, particularly for dense prediction tasks. It introduces USDRL, a negative-sample-free framework that uses feature decorrelation across temporal, spatial, and instance domains, coupled with a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained spatio-temporal patterns. The method combines intra-sample consistency with inter-sample separability through a Multi-Grained Feature Decorrelation loss, incorporating Dense Shift Attention and Convolutional Attention to produce robust dense representations. Experiments on NTU-60, NTU-120, PKU-MMD I/II demonstrate state-of-the-art performance in action recognition, retrieval, and detection, with code released for reproducibility and further research impact.

Abstract

Contrastive learning has achieved great success in skeleton-based representation learning recently. However, the prevailing methods are predominantly negative-based, necessitating additional momentum encoder and memory bank to get negative samples, which increases the difficulty of model training. Furthermore, these methods primarily concentrate on learning a global representation for recognition and retrieval tasks, while overlooking the rich and detailed local representations that are crucial for dense prediction tasks. To alleviate these issues, we introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation, called USDRL, which employs feature decorrelation across temporal, spatial, and instance domains in a multi-grained manner to reduce redundancy among dimensions of the representations to maximize information extraction from features. Additionally, we design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action representations effectively, thereby enhancing the performance of dense prediction tasks. Comprehensive experiments, conducted on the benchmarks NTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks including action recognition, action retrieval, and action detection, conclusively demonstrate that our approach significantly outperforms the current state-of-the-art (SOTA) approaches. Our code and models are available at https://github.com/wengwanjiang/USDRL.

USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation

TL;DR

The paper addresses the inefficiencies of negative-based self-supervised methods in skeleton-based representation learning, particularly for dense prediction tasks. It introduces USDRL, a negative-sample-free framework that uses feature decorrelation across temporal, spatial, and instance domains, coupled with a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained spatio-temporal patterns. The method combines intra-sample consistency with inter-sample separability through a Multi-Grained Feature Decorrelation loss, incorporating Dense Shift Attention and Convolutional Attention to produce robust dense representations. Experiments on NTU-60, NTU-120, PKU-MMD I/II demonstrate state-of-the-art performance in action recognition, retrieval, and detection, with code released for reproducibility and further research impact.

Abstract

Contrastive learning has achieved great success in skeleton-based representation learning recently. However, the prevailing methods are predominantly negative-based, necessitating additional momentum encoder and memory bank to get negative samples, which increases the difficulty of model training. Furthermore, these methods primarily concentrate on learning a global representation for recognition and retrieval tasks, while overlooking the rich and detailed local representations that are crucial for dense prediction tasks. To alleviate these issues, we introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation, called USDRL, which employs feature decorrelation across temporal, spatial, and instance domains in a multi-grained manner to reduce redundancy among dimensions of the representations to maximize information extraction from features. Additionally, we design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action representations effectively, thereby enhancing the performance of dense prediction tasks. Comprehensive experiments, conducted on the benchmarks NTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks including action recognition, action retrieval, and action detection, conclusively demonstrate that our approach significantly outperforms the current state-of-the-art (SOTA) approaches. Our code and models are available at https://github.com/wengwanjiang/USDRL.

Paper Structure

This paper contains 18 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the feature decorrelation-based self-supervised skeleton-based representation learning paradigm. This approach aims to distribute samples uniformly and consistently in the representation space. Unlike Masked Sequence Modeling, it is lightweight and requires no decoder or complex masking strategies. Additionally, it simplifies Negative-based Contrastive Learning by eliminating the need for a memory bank or an additional momentum encoder, making it more streamlined and scalable.
  • Figure 2: The proposed Unified Skeleton-based Dense Representation Learning (USDRL) framework. USDRL incorporates the Dense Spatio-Temporal Encoder (DSTE) and three domain-specific projectors. The DSTE processes skeleton sequences to derive dense representations, which are further refined through MaxPooling and concatenation to generate condensed vectors. The Multi-Grained Feature Decorrelation training loss is devised to mitigate model collapse and guarantee both intra-sample consistency and inter-sample separability.
  • Figure 3: The basic layer of Dense Spatio-Temporal Encoder. It comprises the ConvAttn (CA) and Dense Shift Attn (DSA), where the symbol $\oplus$ denotes the weighted sum.
  • Figure 4: The impact of weight hyperparameter $\alpha$ for action recognition on the xsub evaluation of the NTU-60 dataset.
  • Figure 5: Visualizations of learned instance-level representations obtained by (a) Negative Contrastive Learning (CL), (b) Multi-Grained Feature Decorrelation (MG-FD) w/o $XC$, (c) Single-Grained FD w/ $XC$, and (d) MG-FD w/ $XC$ on the NTU-60. Nine classes from the testing set are randomly selected, and dots of the same color represent actions of the same class.