FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models
Jinglin Xu, Yijie Guo, Yuxin Peng
TL;DR
FinePOSE introduces a diffusion-model-based denoiser for 3D human pose estimation that is guided by fine-grained, learnable prompts encoding action class, body-part granularity, and kinematic cues. The architecture comprises three blocks—Fine-grained Part-aware Prompt Learning (FPP), Fine-grained Prompt-pose Communication (FPC), and Prompt-driven Timestamp Stylization (PTS)—which collectively improve denoising quality and allow controllable, text-driven pose reconstruction from 2D keypoints. It achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP, and demonstrates strong potential in multi-human scenarios on EgoHumans with a dedicated post-integration strategy. The approach combines diffusion-based generation with vision-language prompts (via CLIP) and JPMA-based multi-hypothesis aggregation, offering a new paradigm for incorporating semantic and part-level guidance into 3D HPE. Limitations include higher computational cost and a design not specialized for fully end-to-end multi-person scenes, suggesting avenues for efficiency-focused and multi-person-optimized extensions.
Abstract
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.
