HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation
Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, Hongkai Xiong
TL;DR
HiPART tackles occlusion in monocular 3D HPE by addressing the fundamental limitation of sparse 2D inputs through a two-stage generative densification that yields hierarchical dense 2D poses conditioned on the original sparse input. It introduces Multi-Scale Skeletal Tokenization (MSST) to quantize dense poses into hierarchical tokens and Skeleton-aware Alignment to connect scales, followed by Hierarchical AutoRegressive Modeling (HiARM) that generates tokens in a center-to-periphery, sparse-to-dense fashion. The approach achieves state-of-the-art single-frame 3D HPE on benchmarks like Human3.6M and 3DPW, with robust occlusion handling and lower computational cost compared to many temporal methods, while remaining complementary to temporal lifting. This densification-and-lifting pipeline offers a lightweight, effective alternative for occluded pose estimation and can be integrated with temporal models to further boost robustness and accuracy.
Abstract
Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
