Table of Contents
Fetching ...

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi

TL;DR

The paper tackles frame-level skeleton action localization under self-supervised learning by introducing dense snippet contrastive pretraining and a plug-in U-shaped multiscale fusion module. The dense snippet objective yields temporally discriminative representations, while the U-shaped fusion restores temporal resolution during finetuning for dense predictions. Experiments on BABEL and PKUMMD demonstrate consistent improvements over strong baselines and favorable transfer from long untrimmed sequences, with ablations validating the contribution of each component. Overall, the approach effectively reduces reliance on dense frame-level labels and enhances localization performance across datasets and backbones.

Abstract

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

TL;DR

The paper tackles frame-level skeleton action localization under self-supervised learning by introducing dense snippet contrastive pretraining and a plug-in U-shaped multiscale fusion module. The dense snippet objective yields temporally discriminative representations, while the U-shaped fusion restores temporal resolution during finetuning for dense predictions. Experiments on BABEL and PKUMMD demonstrate consistent improvements over strong baselines and favorable transfer from long untrimmed sequences, with ablations validating the contribution of each component. Overall, the approach effectively reduces reliance on dense frame-level labels and enhances localization performance across datasets and backbones.

Abstract

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Paper Structure

This paper contains 10 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The basic concept of our contrastive approach. (Left) Existing methods treat an entire skeleton sequence as a single sample in contrastive learning. (Right) Our approach delves into a more fine-grained level by considering snippets.
  • Figure 2: Overall pipeline of pretraining and finetuning in proposed designs. In Stage 1, we build upon the the existing video-level skeleton-based CL baselines with a Dense Projection Module to obtain snippet-level skeleton embeddings for fine-grained contrastive learning, aligning matched snippets as positives while separating negatives. In Stage 2, a U-shaped module is plugged into the pretrained skeleton encoder to restore temporal resolution while fusing intermediate features through skip connections.
  • Figure 3: Similarity-based matching.
  • Figure 4: Qualititive visualization of action predictions on BABEL. We compare the ground truth with predictions from three baselines with and without our approach. (Left) there is a brief transition between Sit and Stand-up, where the subject leans forward in preparation. (Right) the subject performs the Jog action multiple times in different directions, with turns separating each action instance.
  • Figure S1: t-SNE visualization of pretrained features on three BABEL subsets. Each point represents the pretrained feature of a single frame after downsampling, which shows how frames from similar actions across videos group together in the feature space.