Table of Contents
Fetching ...

DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation

Jerrin Bright, Yuhao Chen, John S. Zelek

TL;DR

DreamPose3D tackles the challenge of temporal coherence and intent understanding in monocular 3D HPE by casting pose estimation as an intent-conditioned diffusion process that hallucinates temporally coherent pose sequences. It jointly learns action prompts from 2D motion via Action Prompt Learning and uses a Semantic Prompt-driven Denoiser with a kinship-aware attention mechanism, complemented by a Hallucinative Pose Decoder to enforce smooth motion across frames. The method achieves state-of-the-art results on Human3.6M, MPI-INF-3DHP, and MLBPitchDB, and ablations confirm the critical roles of intent guidance, temporal hallucinator, and joint-affinity modeling. This approach offers robust performance under noisy inputs and demonstrates the value of integrating high-level intent with temporally structured diffusion for 3D pose estimation, with potential impact on AR/VR, sports analytics, and embodied AI applications.

Abstract

Accurate 3D human pose estimation remains a critical yet unresolved challenge, requiring both temporal coherence across frames and fine-grained modeling of joint relationships. However, most existing methods rely solely on geometric cues and predict each 3D pose independently, which limits their ability to resolve ambiguous motions and generalize to real-world scenarios. Inspired by how humans understand and anticipate motion, we introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation. DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent. To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism. Finally, a hallucinative pose decoder predicts temporally coherent 3D pose sequences during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception. Extensive experiments on benchmarked Human3.6M and MPI-3DHP datasets demonstrate state-of-the-art performance across all metrics. To further validate DreamPose3D's robustness, we tested it on a broadcast baseball dataset, where it demonstrated strong performance despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.

DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation

TL;DR

DreamPose3D tackles the challenge of temporal coherence and intent understanding in monocular 3D HPE by casting pose estimation as an intent-conditioned diffusion process that hallucinates temporally coherent pose sequences. It jointly learns action prompts from 2D motion via Action Prompt Learning and uses a Semantic Prompt-driven Denoiser with a kinship-aware attention mechanism, complemented by a Hallucinative Pose Decoder to enforce smooth motion across frames. The method achieves state-of-the-art results on Human3.6M, MPI-INF-3DHP, and MLBPitchDB, and ablations confirm the critical roles of intent guidance, temporal hallucinator, and joint-affinity modeling. This approach offers robust performance under noisy inputs and demonstrates the value of integrating high-level intent with temporally structured diffusion for 3D pose estimation, with potential impact on AR/VR, sports analytics, and embodied AI applications.

Abstract

Accurate 3D human pose estimation remains a critical yet unresolved challenge, requiring both temporal coherence across frames and fine-grained modeling of joint relationships. However, most existing methods rely solely on geometric cues and predict each 3D pose independently, which limits their ability to resolve ambiguous motions and generalize to real-world scenarios. Inspired by how humans understand and anticipate motion, we introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation. DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent. To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism. Finally, a hallucinative pose decoder predicts temporally coherent 3D pose sequences during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception. Extensive experiments on benchmarked Human3.6M and MPI-3DHP datasets demonstrate state-of-the-art performance across all metrics. To further validate DreamPose3D's robustness, we tested it on a broadcast baseball dataset, where it demonstrated strong performance despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.

Paper Structure

This paper contains 29 sections, 9 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: High-level overview of DreamPose3D framework.
  • Figure 2: Illustration of intent-driven motion perception.
  • Figure 4: Overview of DreamPose3D model architecture. Given an input 2D pose sequence $\textbf{X}$, DreamPose3D predicts temporally coherent 3D poses through the SPD block, which performs diffusion-based denoising. SPD is conditioned on a context embedding, inferred from the 2D sequence via the APL block. The HPD block complements SPD by generating auxiliary 3D hallucinatory poses to reinforce temporal consistency across frames.
  • Figure 5: Comparison of trajectories on the Human3.6M dataset for a sitting action. The black trajectory represents the groundtruth, while the blue, green, and red trajectories correspond to the predictions from FinePOSE xu2024finepose, KTPFormer ktpformer, and DreamPose3D, respectively.
  • Figure 6: Comparison of the attention maps between ours and FinePOSE xu2024finepose. The x-axis corresponds to queries and the y-axis to predicted outputs. Lighter color indicates stronger attention.
  • ...and 7 more figures