Exploring Self-supervised Skeleton-based Action Recognition in Occluded Environments
Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Yufan Chen, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen
TL;DR
The paper tackles occlusion in self-supervised skeleton-based action recognition for robotics by introducing IosPSTL, which combines a cluster-agnostic KNN imputer with Occluded Partial Spatio-Temporal Learning (OPSTL) and dataset-driven Adaptive Spatial Masking (ASM). It constructs a large occlusion benchmark on NTU-60/NTU-120 and demonstrates state-of-the-art performance under realistic occlusions, with ablations confirming the benefits of ASM and the imputer. The approach is modular and transferable to various self-supervised skeleton methods, and the authors provide code to facilitate reproducibility. Overall, IosPSTL advances robust action recognition in occluded settings, enabling more reliable perception for autonomous robots.
Abstract
To integrate action recognition into autonomous robotic systems, it is essential to address challenges such as person occlusions-a common yet often overlooked scenario in existing self-supervised skeleton-based action recognition methods. In this work, we propose IosPSTL, a simple and effective self-supervised learning framework designed to handle occlusions. IosPSTL combines a cluster-agnostic KNN imputer with an Occluded Partial Spatio-Temporal Learning (OPSTL) strategy. First, we pre-train the model on occluded skeleton sequences. Then, we introduce a cluster-agnostic KNN imputer that performs semantic grouping using k-means clustering on sequence embeddings. It imputes missing skeleton data by applying K-Nearest Neighbors in the latent space, leveraging nearby sample representations to restore occluded joints. This imputation generates more complete skeleton sequences, which significantly benefits downstream self-supervised models. To further enhance learning, the OPSTL module incorporates Adaptive Spatial Masking (ASM) to make better use of intact, high-quality skeleton sequences during training. Our method achieves state-of-the-art performance on the occluded versions of the NTU-60 and NTU-120 datasets, demonstrating its robustness and effectiveness under challenging conditions. Code is available at https://github.com/cyfml/OPSTL.
