Table of Contents
Fetching ...

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Jinglin Xu, Yijie Guo, Yuxin Peng

TL;DR

FinePOSE introduces a diffusion-model-based denoiser for 3D human pose estimation that is guided by fine-grained, learnable prompts encoding action class, body-part granularity, and kinematic cues. The architecture comprises three blocks—Fine-grained Part-aware Prompt Learning (FPP), Fine-grained Prompt-pose Communication (FPC), and Prompt-driven Timestamp Stylization (PTS)—which collectively improve denoising quality and allow controllable, text-driven pose reconstruction from 2D keypoints. It achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP, and demonstrates strong potential in multi-human scenarios on EgoHumans with a dedicated post-integration strategy. The approach combines diffusion-based generation with vision-language prompts (via CLIP) and JPMA-based multi-hypothesis aggregation, offering a new paradigm for incorporating semantic and part-level guidance into 3D HPE. Limitations include higher computational cost and a design not specialized for fully end-to-end multi-person scenes, suggesting avenues for efficiency-focused and multi-person-optimized extensions.

Abstract

The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

TL;DR

FinePOSE introduces a diffusion-model-based denoiser for 3D human pose estimation that is guided by fine-grained, learnable prompts encoding action class, body-part granularity, and kinematic cues. The architecture comprises three blocks—Fine-grained Part-aware Prompt Learning (FPP), Fine-grained Prompt-pose Communication (FPC), and Prompt-driven Timestamp Stylization (PTS)—which collectively improve denoising quality and allow controllable, text-driven pose reconstruction from 2D keypoints. It achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP, and demonstrates strong potential in multi-human scenarios on EgoHumans with a dedicated post-integration strategy. The approach combines diffusion-based generation with vision-language prompts (via CLIP) and JPMA-based multi-hypothesis aggregation, offering a new paradigm for incorporating semantic and part-level guidance into 3D HPE. Limitations include higher computational cost and a design not specialized for fully end-to-end multi-person scenes, suggesting avenues for efficiency-focused and multi-person-optimized extensions.

Abstract

The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.
Paper Structure (15 sections, 6 equations, 2 figures, 6 tables)

This paper contains 15 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The architecture of the proposed FinePOSE. In the diffusion process, Gaussian noise is gradually added to the ground-truth 3D poses $\mathbf{Y}_0$, generating the noisy 3D poses $\mathbf{Y}_t$ for the timestamp $t$. In the denoising process, $\mathbf{Y}_t$, $\mathbf{X}$ and $t$ are fed to fine-grained prompt-driven denoiser $\mathcal{D}$ to reconstruct pure 3D poses $\hat{\mathbf{Y}}_0$. $\mathcal{D}$ is composed of a Fine-grained Part-aware Prompt learning (FPP) block, a Fine-grained Prompt-pose Communication (FPC) block, and a Prompt-driven Timestamp Stylization (PTS) block, where FPP provides more precise guidance for all human part movements, FPC establishes fine-grained communications between learnable prompts and poses for enhancing the denoising capability, and PTS integrates learned prompt embedding and current timestamp for refining the prediction at each noise level.
  • Figure 2: Qualitative comparisons of our FinePOSE with MotionBERT zhu2023motionbert and D3DP Shan_2023_ICCV on Human3.6M. The gray skeleton is the ground-truth 3D pose. The blue skeleton represents the prediction of the human left part, and the orange indicates the human right part. The red dashed line represents the incorrect regions of the compared methods, and the blue dashed line indicates the counterparts of FinePOSE.