Are Pose Estimators Ready for the Open World? STAGE: A GenAI Toolkit for Auditing 3D Human Pose Estimators
Nikita Kister, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll
TL;DR
This work introduces STAGE, a GenAI-based toolkit for auditing 3D human pose estimators under open-world conditions by generating paired base/attribute images that isolate single attributes. It develops CN-3DPose, a 3D-pose-controlled image generator that unifies SMPL-based depth, dense semantic encoding, and 2D skeleton cues to achieve accurate 3D pose conditioning and diverse backgrounds. The authors define the PDP metric to quantify the propensity of pose estimators to degrade when attributes such as clothing, weather, or location change, and demonstrate that common SOTA estimators exhibit substantial sensitivity, especially to clothing and indoor settings. STAGE enables scalable, low-cost robustness testing and data augmentation for HPE, and their results show that larger, more diverse training data and smarter architectures can improve resilience, with STAGE data rivaling CG datasets like BEDLAM in training effectiveness. The work highlights critical robustness gaps in current 3D HPE systems and provides a practical, extensible path toward safer, open-world deployment in safety-critical applications.
Abstract
For safety-critical applications, it is crucial to audit 3D human pose estimators before deployment. Will the system break down if the weather or the clothing changes? Is it robust regarding gender and age? To answer these questions and more, we need controlled studies with images that differ in a single attribute, but real benchmarks cannot provide such pairs. We thus present STAGE, a GenAI data toolkit for auditing 3D human pose estimators. For STAGE, we develop the first GenAI image creator with accurate 3D pose control and propose a novel evaluation strategy to isolate and quantify the effects of single factors such as gender, ethnicity, age, clothing, location, and weather. Enabled by STAGE, we generate a series of benchmarks to audit, for the first time, the sensitivity of popular pose estimators towards such factors. Our results show that natural variations can severely degrade pose estimator performance, raising doubts about their readiness for open-world deployment. We aim to highlight these robustness issues and establish STAGE as a benchmark to quantify them.
