Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization
Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi
TL;DR
The paper tackles frame-level skeleton action localization under self-supervised learning by introducing dense snippet contrastive pretraining and a plug-in U-shaped multiscale fusion module. The dense snippet objective yields temporally discriminative representations, while the U-shaped fusion restores temporal resolution during finetuning for dense predictions. Experiments on BABEL and PKUMMD demonstrate consistent improvements over strong baselines and favorable transfer from long untrimmed sequences, with ablations validating the contribution of each component. Overall, the approach effectively reduces reliance on dense frame-level labels and enhances localization performance across datasets and backbones.
Abstract
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
