Table of Contents
Fetching ...

Skeleton-Based Human Action Recognition with Noisy Labels

Yi Xu, Kunyu Peng, Di Wen, Ruiping Liu, Junwei Zheng, Yufan Chen, Jiaming Zhang, Alina Roitberg, Kailun Yang, Rainer Stiefelhagen

TL;DR

Skeleton-based action recognition is vital for robot-assisted perception but suffers from noisy labels due to sparse skeleton data. The authors benchmark existing label-denoising approaches on SHAR and introduce NoiseEraSAR, a framework that combines cross-training across joint, bone, and motion modalities with global sample selection and Cross-Modal Mixture-of-Experts to denoise labels. On NTU-60, NoiseEraSAR achieves 74.9% X-Sub and 79.5% X-View at 80% symmetric noise, surpassing SotP and NPC baselines and establishing state-of-the-art performance under heavy label noise. The work provides a practical benchmark and a robust training paradigm for skeleton-based action recognition in real-world noisy-label settings, with open-source code available at the provided URL.

Abstract

Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study is accessible at https://github.com/xuyizdby/NoiseEraSAR.

Skeleton-Based Human Action Recognition with Noisy Labels

TL;DR

Skeleton-based action recognition is vital for robot-assisted perception but suffers from noisy labels due to sparse skeleton data. The authors benchmark existing label-denoising approaches on SHAR and introduce NoiseEraSAR, a framework that combines cross-training across joint, bone, and motion modalities with global sample selection and Cross-Modal Mixture-of-Experts to denoise labels. On NTU-60, NoiseEraSAR achieves 74.9% X-Sub and 79.5% X-View at 80% symmetric noise, surpassing SotP and NPC baselines and establishing state-of-the-art performance under heavy label noise. The work provides a practical benchmark and a robust training paradigm for skeleton-based action recognition in real-world noisy-label settings, with open-source code available at the provided URL.

Abstract

Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study is accessible at https://github.com/xuyizdby/NoiseEraSAR.
Paper Structure (20 sections, 14 equations, 3 figures, 3 tables)

This paper contains 20 sections, 14 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An overview of our task setting. We randomly inject asymmetric label noise into the training set according to a predefined noise ratio. On the right-hand side, we deliver the comparison of the performances on the test set with correct labels in terms of the Cross-Subject (X-Sub) and the Cross-View (X-View) settings, where our approach shows the best performance.
  • Figure 2: Overview of the method, NoiseEraSAR: In the pre-training phase, the proposed method first trains special models for joint, bone, and motion modalities by using a cross training method. The small clean dataset is generated from the pre-trained models by evaluating the loss value, and it is fed into the Cross-Modal Mixture-of-Experts (CM-MoE). In the fine-tuning phase, the gate network is added to control the weights of each expert and assists the CM-MoE.
  • Figure 3: Prediction results of the three methods. Four samples from different action classes are visualized on the left. The Top-5 SoftMax scores are drawn on the right with the ground truth (green color). All these predictions are generated from the models under the $80\%$ Cross-View setting.