Table of Contents
Fetching ...

Are Pose Estimators Ready for the Open World? STAGE: A GenAI Toolkit for Auditing 3D Human Pose Estimators

Nikita Kister, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

TL;DR

This work introduces STAGE, a GenAI-based toolkit for auditing 3D human pose estimators under open-world conditions by generating paired base/attribute images that isolate single attributes. It develops CN-3DPose, a 3D-pose-controlled image generator that unifies SMPL-based depth, dense semantic encoding, and 2D skeleton cues to achieve accurate 3D pose conditioning and diverse backgrounds. The authors define the PDP metric to quantify the propensity of pose estimators to degrade when attributes such as clothing, weather, or location change, and demonstrate that common SOTA estimators exhibit substantial sensitivity, especially to clothing and indoor settings. STAGE enables scalable, low-cost robustness testing and data augmentation for HPE, and their results show that larger, more diverse training data and smarter architectures can improve resilience, with STAGE data rivaling CG datasets like BEDLAM in training effectiveness. The work highlights critical robustness gaps in current 3D HPE systems and provides a practical, extensible path toward safer, open-world deployment in safety-critical applications.

Abstract

For safety-critical applications, it is crucial to audit 3D human pose estimators before deployment. Will the system break down if the weather or the clothing changes? Is it robust regarding gender and age? To answer these questions and more, we need controlled studies with images that differ in a single attribute, but real benchmarks cannot provide such pairs. We thus present STAGE, a GenAI data toolkit for auditing 3D human pose estimators. For STAGE, we develop the first GenAI image creator with accurate 3D pose control and propose a novel evaluation strategy to isolate and quantify the effects of single factors such as gender, ethnicity, age, clothing, location, and weather. Enabled by STAGE, we generate a series of benchmarks to audit, for the first time, the sensitivity of popular pose estimators towards such factors. Our results show that natural variations can severely degrade pose estimator performance, raising doubts about their readiness for open-world deployment. We aim to highlight these robustness issues and establish STAGE as a benchmark to quantify them.

Are Pose Estimators Ready for the Open World? STAGE: A GenAI Toolkit for Auditing 3D Human Pose Estimators

TL;DR

This work introduces STAGE, a GenAI-based toolkit for auditing 3D human pose estimators under open-world conditions by generating paired base/attribute images that isolate single attributes. It develops CN-3DPose, a 3D-pose-controlled image generator that unifies SMPL-based depth, dense semantic encoding, and 2D skeleton cues to achieve accurate 3D pose conditioning and diverse backgrounds. The authors define the PDP metric to quantify the propensity of pose estimators to degrade when attributes such as clothing, weather, or location change, and demonstrate that common SOTA estimators exhibit substantial sensitivity, especially to clothing and indoor settings. STAGE enables scalable, low-cost robustness testing and data augmentation for HPE, and their results show that larger, more diverse training data and smarter architectures can improve resilience, with STAGE data rivaling CG datasets like BEDLAM in training effectiveness. The work highlights critical robustness gaps in current 3D HPE systems and provides a practical, extensible path toward safer, open-world deployment in safety-critical applications.

Abstract

For safety-critical applications, it is crucial to audit 3D human pose estimators before deployment. Will the system break down if the weather or the clothing changes? Is it robust regarding gender and age? To answer these questions and more, we need controlled studies with images that differ in a single attribute, but real benchmarks cannot provide such pairs. We thus present STAGE, a GenAI data toolkit for auditing 3D human pose estimators. For STAGE, we develop the first GenAI image creator with accurate 3D pose control and propose a novel evaluation strategy to isolate and quantify the effects of single factors such as gender, ethnicity, age, clothing, location, and weather. Enabled by STAGE, we generate a series of benchmarks to audit, for the first time, the sensitivity of popular pose estimators towards such factors. Our results show that natural variations can severely degrade pose estimator performance, raising doubts about their readiness for open-world deployment. We aim to highlight these robustness issues and establish STAGE as a benchmark to quantify them.
Paper Structure (43 sections, 3 equations, 16 figures, 4 tables)

This paper contains 43 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: How robust are 3D human pose estimators to different attributes such as gender, age, or clothing? To measure this, STAGE generates a base and an attribute image, where the difference is one specific control attribute---the pose is the same, and all other attributes are kept similar. In contrast to real benchmarks, this allows us to isolate the effect of one attribute on the pose estimator's performance. E.g., notice how changing the gender (row 1) or clothing (rows 2) can lead to significant pose estimation failures.
  • Figure 2: Evaluating estimators with STAGE. Given base and attribute prompts along with a desired 3D pose condition, we generate a pair of test images using our CN-3DPose image generator. We encode the pose as a depth map, a dense semantic encoding and a 2D skeleton drawing. To preserve visual similarity between the images except for the target attribute, we use the same initial noise for both images. The result is a test image pair that differs mostly by our attribute of interest, suitable for controlled experiments. We run the pose estimator to be tested on both images and compare the prediction errors to measure its sensitivity towards the attribute.
  • Figure 3: Images generated via STAGE. By simply varying the text prompt, STAGE can generate images of people with different body shapes and appearances in different locations, while remaining well-aligned with the specified 3D pose. When only the person's description is changed, the background remains similar. This is important for our controlled experiments targeting single attributes.
  • Figure 4: State-of-the-art estimators can break with a simple attribute change. For every image pair, we show the predictions of the pose estimator named on the left of the pair. The base prompt is "Caucasian adult {male/female} wearing a shirt in the city during daytime."
  • Figure 5: Pose estimators are sensitive to individual attribute changes. For example, clothing texture can have a large impact on performance of pose estimators such as PARE.
  • ...and 11 more figures