Table of Contents
Fetching ...

Shap-Mix: Shapley Value Guided Mixing for Long-Tailed Skeleton Based Action Recognition

Jiahang Zhang, Lilang Lin, Jiaying Liu

TL;DR

Shap-Mix addresses the challenge of long-tailed skeleton-based action recognition by introducing a two-tier augmentation framework that combines spatial-temporal skeleton mixing (ST-Mix) with Shapley-value-guided saliency to preserve salient minority-class motion patterns. It maintains online saliency estimation via EMA and employs a tail-aware data synthesis distribution to improve decision boundaries for tail classes, all within end-to-end training and a balanced-softmax objective. Across NTU 60/120 and Kinetics Skeleton 400, the method achieves strong improvements on long-tailed distributions while remaining competitive on balanced data, with ablations confirming the effectiveness of both the ST-Mix design and Shapley-guided guidance. This work provides a practical, backbone-agnostic augmentation approach for robust skeleton-based action recognition in real-world, imbalanced settings, with code publicly available for reproducibility.

Abstract

In real-world scenarios, human actions often fall into a long-tailed distribution. It makes the existing skeleton-based action recognition works, which are mostly designed based on balanced datasets, suffer from a sharp performance degradation. Recently, many efforts have been madeto image/video long-tailed learning. However, directly applying them to skeleton data can be sub-optimal due to the lack of consideration of the crucial spatial-temporal motion patterns, especially for some modality-specific methodologies such as data augmentation. To this end, considering the crucial role of the body parts in the spatially concentrated human actions, we attend to the mixing augmentations and propose a novel method, Shap-Mix, which improves long-tailed learning by mining representative motion patterns for tail categories. Specifically, we first develop an effective spatial-temporal mixing strategy for the skeleton to boost representation quality. Then, the employed saliency guidance method is presented, consisting of the saliency estimation based on Shapley value and a tail-aware mixing policy. It preserves the salient motion parts of minority classes in mixed data, explicitly establishing the relationships between crucial body structure cues and high-level semantics. Extensive experiments on three large-scale skeleton datasets show our remarkable performance improvement under both long-tailed and balanced settings. Our project is publicly available at: https://jhang2020.github.io/Projects/Shap-Mix/Shap-Mix.html.

Shap-Mix: Shapley Value Guided Mixing for Long-Tailed Skeleton Based Action Recognition

TL;DR

Shap-Mix addresses the challenge of long-tailed skeleton-based action recognition by introducing a two-tier augmentation framework that combines spatial-temporal skeleton mixing (ST-Mix) with Shapley-value-guided saliency to preserve salient minority-class motion patterns. It maintains online saliency estimation via EMA and employs a tail-aware data synthesis distribution to improve decision boundaries for tail classes, all within end-to-end training and a balanced-softmax objective. Across NTU 60/120 and Kinetics Skeleton 400, the method achieves strong improvements on long-tailed distributions while remaining competitive on balanced data, with ablations confirming the effectiveness of both the ST-Mix design and Shapley-guided guidance. This work provides a practical, backbone-agnostic augmentation approach for robust skeleton-based action recognition in real-world, imbalanced settings, with code publicly available for reproducibility.

Abstract

In real-world scenarios, human actions often fall into a long-tailed distribution. It makes the existing skeleton-based action recognition works, which are mostly designed based on balanced datasets, suffer from a sharp performance degradation. Recently, many efforts have been madeto image/video long-tailed learning. However, directly applying them to skeleton data can be sub-optimal due to the lack of consideration of the crucial spatial-temporal motion patterns, especially for some modality-specific methodologies such as data augmentation. To this end, considering the crucial role of the body parts in the spatially concentrated human actions, we attend to the mixing augmentations and propose a novel method, Shap-Mix, which improves long-tailed learning by mining representative motion patterns for tail categories. Specifically, we first develop an effective spatial-temporal mixing strategy for the skeleton to boost representation quality. Then, the employed saliency guidance method is presented, consisting of the saliency estimation based on Shapley value and a tail-aware mixing policy. It preserves the salient motion parts of minority classes in mixed data, explicitly establishing the relationships between crucial body structure cues and high-level semantics. Extensive experiments on three large-scale skeleton datasets show our remarkable performance improvement under both long-tailed and balanced settings. Our project is publicly available at: https://jhang2020.github.io/Projects/Shap-Mix/Shap-Mix.html.
Paper Structure (40 sections, 4 equations, 5 figures, 11 tables)

This paper contains 40 sections, 4 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Random mix, e.g., Cut-Mix, treats different classes equally and causes the semantic confusion, degrading the long-tailed performance especially for tail categories. In contrast, our Shap-Mix generates representative samples for tail categories to recover the underlying distribution, obtaining a better decision boundary.
  • Figure 2: A simplified illustration of Shap-Mix. We first perform the online saliency estimation using Eq. (2). In this example, $r =${right leg} and $b =$ {trunk, right arm}. For dotted joints, we use the mean of the dataset as the static sequence. The calculated Shapley value is used to update the Shapley value list $v^c$ by EMA. Finally, the mixed data is generated, preserving the representative motion patterns of the minority class (wave in this example).
  • Figure 3: The visualization of our saliency estimation, in the form of action, the first, second, and third most salient part combination. We choose the actions from many- (first 2), medium- (3-5), and few-shot (last 3) classes.
  • Figure 4: Comparison of our method with the baseline in terms of the accuracy of per-class. The sample number of each class is also presented using the red line.
  • Figure 5: Visualization results of the Shapley value guided saliency estimation on LT-NTU 60 dataset. The first, second, and the third rows are the actions from many-, medium-, few-shot classes, respectively, where the top 5 most salient parts are given. Note that the Shapley value is normalized and the average saliency is 0.05.