Table of Contents
Fetching ...

Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments

Aaditya Khanal, Junxiu Zhou

Abstract

The practical deployment gap -- transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation -- introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC >= 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.

Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments

Abstract

The practical deployment gap -- transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation -- introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC >= 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.
Paper Structure (25 sections, 1 equation, 7 figures, 4 tables)

This paper contains 25 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: t-SNE visualization of Skeleton Transformer features. NTU-120 (ID) classes form well-separated clusters, while Gym2D (OOD) samples (red) collapse into the interior of NTU clusters despite being systematically mis-classified. This explains why softmax confidence remains high for wrong predictions.
  • Figure 2: Energy score distributions for NTU-120 (ID, blue) and Gym2D (OOD, red). Clear distributional separation indicates reliable OOD detection, but does not resolve the risk-coverage failure (Fig. \ref{['fig:risk_coverage']}).
  • Figure 3: ROC curves for Mahalanobis OOD detection. The Skeleton Transformer (AUROC $= 0.902$) outperforms ST-GCN, correctly identifying out-of-distribution physical postures where confidence-based metrics fail.
  • Figure 4: Risk-Coverage curves on the Gym2D (OOD) dataset. Left: frozen gating and zero-shot baselines maintain $\approx 100\%$ risk at all coverage levels, confirming that standard UQ methods fail to identify correct OOD predictions. Right: finetuned gating achieves meaningful risk reduction with decreasing coverage, enabling graceful abstention.
  • Figure 5: Reliability diagram for the finetuned gating model on Gym2D. Calibrated confidence now tracks actual accuracy---a necessary pre-condition for meaningful selective classification.
  • ...and 2 more figures