Table of Contents
Fetching ...

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, Hongkai Xiong

TL;DR

HiPART tackles occlusion in monocular 3D HPE by addressing the fundamental limitation of sparse 2D inputs through a two-stage generative densification that yields hierarchical dense 2D poses conditioned on the original sparse input. It introduces Multi-Scale Skeletal Tokenization (MSST) to quantize dense poses into hierarchical tokens and Skeleton-aware Alignment to connect scales, followed by Hierarchical AutoRegressive Modeling (HiARM) that generates tokens in a center-to-periphery, sparse-to-dense fashion. The approach achieves state-of-the-art single-frame 3D HPE on benchmarks like Human3.6M and 3DPW, with robust occlusion handling and lower computational cost compared to many temporal methods, while remaining complementary to temporal lifting. This densification-and-lifting pipeline offers a lightweight, effective alternative for occluded pose estimation and can be integrated with temporal models to further boost robustness and accuracy.

Abstract

Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

TL;DR

HiPART tackles occlusion in monocular 3D HPE by addressing the fundamental limitation of sparse 2D inputs through a two-stage generative densification that yields hierarchical dense 2D poses conditioned on the original sparse input. It introduces Multi-Scale Skeletal Tokenization (MSST) to quantize dense poses into hierarchical tokens and Skeleton-aware Alignment to connect scales, followed by Hierarchical AutoRegressive Modeling (HiARM) that generates tokens in a center-to-periphery, sparse-to-dense fashion. The approach achieves state-of-the-art single-frame 3D HPE on benchmarks like Human3.6M and 3DPW, with robust occlusion handling and lower computational cost compared to many temporal methods, while remaining complementary to temporal lifting. This densification-and-lifting pipeline offers a lightweight, effective alternative for occluded pose estimation and can be integrated with temporal models to further boost robustness and accuracy.

Abstract

Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.

Paper Structure

This paper contains 19 sections, 11 equations, 16 figures, 7 tables, 3 algorithms.

Figures (16)

  • Figure 1: Top: Comparison of temporal-based and visual-based methods with our densification approach at the framework level. These methods enrich information in the lifting stage, while ours address a more fundamental issue, i.e., the sparse 2D pose input. Bottom: Comparison of parameters, GFLOPs, and MPJPE across various methods on Human3.6M. The circle size indicates GFLOPs for inference. Our method achieves the SOTA result with reduced complexity compared to the temporal-based methods.
  • Figure 2: Visualization of reconstructed 3D poses and hierarchical 2D poses on Human3.6M under occlusions. Ours outperforms DiffPose gong2023DiffPose due to the rich skeletal context in hierarchical poses.
  • Figure 3: The overview of the two-stage generative densification method, Hierarchical Pose AutoRegressive Transformer (HiPART). In Stage 1, the MSST module progressively quantizes the fine 2D pose into hierarchical tokens with Skeleton-aware Alignment. We omit the reconstruction of the dense 2D pose $\hat{\mathbf{x}}_d$ for simplicity. In Stage 2, we propose a HiARM scheme to generate $(i+1)$-th pair tokens based on a series of indices less than $i+1$. One pair of discrete tokens contains a single sparse token and $r$ dense tokens corresponding to the related part. Finally, generated hierarchical 2D poses are fed to a vanilla spatial transformer for subsequent 2D-to-3D lifting.
  • Figure 4: Standard auto-regressive modeling (Top) vs Our proposed hierarchical pose auto-regressive modeling (Bottom).
  • Figure 5: Impact of the codebook dimension (top) and size (bottom) for sparse and dense tokens on Humman3.6M.
  • ...and 11 more figures