Table of Contents
Fetching ...

Benchmarking 3D Human Pose Estimation Models under Occlusions

Filipa Lino, Carlos Santiago, Manuel Marques

TL;DR

This work systematically benchmarks nine SOTA 2D-to-3D HPE models under realistic occlusions using the BlendMimic3D dataset, highlighting how occlusion-induced noise degrades 3D reconstruction across architectures. By introducing two occlusion protocols— global noise injection on occluded points and per-keypoint occlusion analysis—the study reveals consistent distal-joint vulnerabilities and shows that diffusion-based methods are particularly sensitive to noisy 2D inputs. Occlusion-aware models offer partial robustness, yet performance gains come with trade-offs on non-occluded scenarios, emphasizing the need for uncertainty-aware denoisers and joint-specific reliability. The findings underscore domain-shift challenges between controlled training data and real-world occlusion patterns, guiding future design toward more robust, occlusion-aware, and generalizable 3D HPE systems.

Abstract

Human Pose Estimation (HPE) involves detecting and localizing keypoints on the human body from visual data. In 3D HPE, occlusions, where parts of the body are not visible in the image, pose a significant challenge for accurate pose reconstruction. This paper presents a benchmark on the robustness of 3D HPE models under realistic occlusion conditions, involving combinations of occluded keypoints commonly observed in real-world scenarios. We evaluate nine state-of-the-art 2D-to-3D HPE models, spanning convolutional, transformer-based, graph-based, and diffusion-based architectures, using the BlendMimic3D dataset, a synthetic dataset with ground-truth 2D/3D annotations and occlusion labels. All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that simulates occlusion by adding noise into 2D keypoints based on real detector behavior, and conduct both global and per-joint sensitivity analyses. Our findings reveal that all models exhibit notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Additionally, a per-joint occlusion analysis identifies consistent vulnerability in distal joints (e.g., wrists, feet) across models. Overall, this work highlights critical limitations of current 3D HPE models in handling occlusions, and provides insights for improving real-world robustness.

Benchmarking 3D Human Pose Estimation Models under Occlusions

TL;DR

This work systematically benchmarks nine SOTA 2D-to-3D HPE models under realistic occlusions using the BlendMimic3D dataset, highlighting how occlusion-induced noise degrades 3D reconstruction across architectures. By introducing two occlusion protocols— global noise injection on occluded points and per-keypoint occlusion analysis—the study reveals consistent distal-joint vulnerabilities and shows that diffusion-based methods are particularly sensitive to noisy 2D inputs. Occlusion-aware models offer partial robustness, yet performance gains come with trade-offs on non-occluded scenarios, emphasizing the need for uncertainty-aware denoisers and joint-specific reliability. The findings underscore domain-shift challenges between controlled training data and real-world occlusion patterns, guiding future design toward more robust, occlusion-aware, and generalizable 3D HPE systems.

Abstract

Human Pose Estimation (HPE) involves detecting and localizing keypoints on the human body from visual data. In 3D HPE, occlusions, where parts of the body are not visible in the image, pose a significant challenge for accurate pose reconstruction. This paper presents a benchmark on the robustness of 3D HPE models under realistic occlusion conditions, involving combinations of occluded keypoints commonly observed in real-world scenarios. We evaluate nine state-of-the-art 2D-to-3D HPE models, spanning convolutional, transformer-based, graph-based, and diffusion-based architectures, using the BlendMimic3D dataset, a synthetic dataset with ground-truth 2D/3D annotations and occlusion labels. All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that simulates occlusion by adding noise into 2D keypoints based on real detector behavior, and conduct both global and per-joint sensitivity analyses. Our findings reveal that all models exhibit notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Additionally, a per-joint occlusion analysis identifies consistent vulnerability in distal joints (e.g., wrists, feet) across models. Overall, this work highlights critical limitations of current 3D HPE models in handling occlusions, and provides insights for improving real-world robustness.

Paper Structure

This paper contains 28 sections, 3 equations, 70 figures, 6 tables.

Figures (70)

  • Figure 1: BlendMimic3D frames show a subject without (top left) and with occlusion (bottom left). Corresponding 3D ground-truth (top right) and noisy prediction (bottom right) illustrate occlusion impact.
  • Figure 2: Comparison of 2D human pose representations across different formats. (Left) Human3.6M (H36M) format over the original image. (Middle-Left) BlendMimic3D (BM3D) format and its version adapted to H36M. (Middle-Right) The corresponding 2D poses. (Right) The corresponding 3D poses.
  • Figure 3:
  • Figure 4:
  • Figure 6: 2D human poses derived from the ground-truth skeleton, with varying levels of Gaussian noise applied to the occluded keypoints: the nose, hands, knees, and feet.
  • ...and 65 more figures