Benchmarking 3D Human Pose Estimation Models under Occlusions
Filipa Lino, Carlos Santiago, Manuel Marques
TL;DR
This work systematically benchmarks nine SOTA 2D-to-3D HPE models under realistic occlusions using the BlendMimic3D dataset, highlighting how occlusion-induced noise degrades 3D reconstruction across architectures. By introducing two occlusion protocols— global noise injection on occluded points and per-keypoint occlusion analysis—the study reveals consistent distal-joint vulnerabilities and shows that diffusion-based methods are particularly sensitive to noisy 2D inputs. Occlusion-aware models offer partial robustness, yet performance gains come with trade-offs on non-occluded scenarios, emphasizing the need for uncertainty-aware denoisers and joint-specific reliability. The findings underscore domain-shift challenges between controlled training data and real-world occlusion patterns, guiding future design toward more robust, occlusion-aware, and generalizable 3D HPE systems.
Abstract
Human Pose Estimation (HPE) involves detecting and localizing keypoints on the human body from visual data. In 3D HPE, occlusions, where parts of the body are not visible in the image, pose a significant challenge for accurate pose reconstruction. This paper presents a benchmark on the robustness of 3D HPE models under realistic occlusion conditions, involving combinations of occluded keypoints commonly observed in real-world scenarios. We evaluate nine state-of-the-art 2D-to-3D HPE models, spanning convolutional, transformer-based, graph-based, and diffusion-based architectures, using the BlendMimic3D dataset, a synthetic dataset with ground-truth 2D/3D annotations and occlusion labels. All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that simulates occlusion by adding noise into 2D keypoints based on real detector behavior, and conduct both global and per-joint sensitivity analyses. Our findings reveal that all models exhibit notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Additionally, a per-joint occlusion analysis identifies consistent vulnerability in distal joints (e.g., wrists, feet) across models. Overall, this work highlights critical limitations of current 3D HPE models in handling occlusions, and provides insights for improving real-world robustness.
