Table of Contents
Fetching ...

Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

Yuquan Bi, Hongsong Wang, Xinli Shi, Zhipeng Gui, Jie Gui, Yuan Yan Tang

TL;DR

An Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics.

Abstract

Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5\%, inference MACs by 56.8\%, and improves inference speed by an average of 81.1\% compared to prior diffusion-based methods, while achieving state-of-the-art performance.

Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

TL;DR

An Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics.

Abstract

Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5\%, inference MACs by 56.8\%, and improves inference speed by an average of 81.1\% compared to prior diffusion-based methods, while achieving state-of-the-art performance.

Paper Structure

This paper contains 21 sections, 21 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: MACs and MPJPE of different methods on the Human3.6M dataset. We achieve the best performance while demonstrating highly competitive MACs results. $^{\ast}$ indicates diffusion-based methods.
  • Figure 2: The architecture of the proposed HTP. The framework is structured into two hierarchical pruning phases: (a) Frame-level Pruning and (b) Semantic-level Pruning. At each diffusion step, the input $[\boldsymbol{y}_t\oplus\boldsymbol{x}]$ is embedded and processed by the Spatial GCN agformer and Spatial MHSA. In Phase (a), TCEP first infers a sparse temporal mask $\mathbf{M}$ and pruned features, which then guide the SFT MHSA in modeling frame-to-frame dependencies on the full sequence length $F$ with reduced redundancy. In Phase (b), MGPTP physically condenses the sequence from $F$ to $f$ by aggregating representative tokens. Finally, Cross MHSA restores the original length $F$ for prediction. Note that $\mathbf{M}'$ and $\overline{\mathbf{M}}$ denote variants of $\mathbf{M}$ adapted for SFT MHSA and MGPTP, respectively. Pose embedding, Spatial MHSA, and full Temporal MHSA are standard operations in 3D HPE.
  • Figure 3: The architecture of the MGPTP.
  • Figure 4: The sensitivity analysis of $\eta$ and $k$.
  • Figure 5: Qualitative comparisons of our HTP with previous state-of-the-art methods d3dpktpformerfinepose on the Human3.6M dataset. Solid blue line: ground-truth 3D pose. Solid red line: estimated 3D pose.
  • ...and 3 more figures