Table of Contents
Fetching ...

Exploring Self-supervised Skeleton-based Action Recognition in Occluded Environments

Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Yufan Chen, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen

TL;DR

The paper tackles occlusion in self-supervised skeleton-based action recognition for robotics by introducing IosPSTL, which combines a cluster-agnostic KNN imputer with Occluded Partial Spatio-Temporal Learning (OPSTL) and dataset-driven Adaptive Spatial Masking (ASM). It constructs a large occlusion benchmark on NTU-60/NTU-120 and demonstrates state-of-the-art performance under realistic occlusions, with ablations confirming the benefits of ASM and the imputer. The approach is modular and transferable to various self-supervised skeleton methods, and the authors provide code to facilitate reproducibility. Overall, IosPSTL advances robust action recognition in occluded settings, enabling more reliable perception for autonomous robots.

Abstract

To integrate action recognition into autonomous robotic systems, it is essential to address challenges such as person occlusions-a common yet often overlooked scenario in existing self-supervised skeleton-based action recognition methods. In this work, we propose IosPSTL, a simple and effective self-supervised learning framework designed to handle occlusions. IosPSTL combines a cluster-agnostic KNN imputer with an Occluded Partial Spatio-Temporal Learning (OPSTL) strategy. First, we pre-train the model on occluded skeleton sequences. Then, we introduce a cluster-agnostic KNN imputer that performs semantic grouping using k-means clustering on sequence embeddings. It imputes missing skeleton data by applying K-Nearest Neighbors in the latent space, leveraging nearby sample representations to restore occluded joints. This imputation generates more complete skeleton sequences, which significantly benefits downstream self-supervised models. To further enhance learning, the OPSTL module incorporates Adaptive Spatial Masking (ASM) to make better use of intact, high-quality skeleton sequences during training. Our method achieves state-of-the-art performance on the occluded versions of the NTU-60 and NTU-120 datasets, demonstrating its robustness and effectiveness under challenging conditions. Code is available at https://github.com/cyfml/OPSTL.

Exploring Self-supervised Skeleton-based Action Recognition in Occluded Environments

TL;DR

The paper tackles occlusion in self-supervised skeleton-based action recognition for robotics by introducing IosPSTL, which combines a cluster-agnostic KNN imputer with Occluded Partial Spatio-Temporal Learning (OPSTL) and dataset-driven Adaptive Spatial Masking (ASM). It constructs a large occlusion benchmark on NTU-60/NTU-120 and demonstrates state-of-the-art performance under realistic occlusions, with ablations confirming the benefits of ASM and the imputer. The approach is modular and transferable to various self-supervised skeleton methods, and the authors provide code to facilitate reproducibility. Overall, IosPSTL advances robust action recognition in occluded settings, enabling more reliable perception for autonomous robots.

Abstract

To integrate action recognition into autonomous robotic systems, it is essential to address challenges such as person occlusions-a common yet often overlooked scenario in existing self-supervised skeleton-based action recognition methods. In this work, we propose IosPSTL, a simple and effective self-supervised learning framework designed to handle occlusions. IosPSTL combines a cluster-agnostic KNN imputer with an Occluded Partial Spatio-Temporal Learning (OPSTL) strategy. First, we pre-train the model on occluded skeleton sequences. Then, we introduce a cluster-agnostic KNN imputer that performs semantic grouping using k-means clustering on sequence embeddings. It imputes missing skeleton data by applying K-Nearest Neighbors in the latent space, leveraging nearby sample representations to restore occluded joints. This imputation generates more complete skeleton sequences, which significantly benefits downstream self-supervised models. To further enhance learning, the OPSTL module incorporates Adaptive Spatial Masking (ASM) to make better use of intact, high-quality skeleton sequences during training. Our method achieves state-of-the-art performance on the occluded versions of the NTU-60 and NTU-120 datasets, demonstrating its robustness and effectiveness under challenging conditions. Code is available at https://github.com/cyfml/OPSTL.
Paper Structure (17 sections, 10 equations, 2 figures, 6 tables)

This paper contains 17 sections, 10 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Comparison of different imputation methods. In (a), we compare random imputations (in gray), our imputation results (in blue), and ground-truth skeletons (in red). In (b) and (c), the linear evaluation results of cross-subject (xsub) and cross-view (xview) settings are tested by using imputation methods across three popular self-supervised action recognition methods (CrossCLR li20213d, AimCLR guo2021contrastive, and PSTL zhou2023selfsupervised).
  • Figure 2: Our two-stage method for completing missing skeleton coordinates. The red portion in input $I$ represents the missing skeleton. In the first stage, the pre-training model adopts the PSTL framework, with CSM replaced by ASM, to better utilize high-quality data. The second stage involves completing the entire dataset by partitioning samples into smaller clusters using KMeans. Subsequently, the cluster-agnostic KNN-imputer is proposed to find neighboring samples and complete the missing coordinates. Yellow points in the sample space are samples from the test set, and purple points are samples from the training set.