Table of Contents
Fetching ...

Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior

Calvin-Khang Ta, Arindam Dutta, Rohit Kundu, Rohit Lal, Hannah Dela Cruz, Dripta S. Raychaudhuri, Amit Roy-Chowdhury

TL;DR

MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters, offering powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text.

Abstract

The Skinned Multi-Person Linear (SMPL) model plays a crucial role in 3D human pose estimation, providing a streamlined yet effective representation of the human body. However, ensuring the validity of SMPL configurations during tasks such as human mesh regression remains a significant challenge , highlighting the necessity for a robust human pose prior capable of discerning realistic human poses. To address this, we introduce MOPED: \underline{M}ulti-m\underline{O}dal \underline{P}os\underline{E} \underline{D}iffuser. MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters. Our method offers powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text. This capability enhances the applicability of our approach by incorporating additional context often overlooked in traditional pose priors. Extensive experiments across three distinct tasks-pose estimation, pose denoising, and pose completion-demonstrate that our multi-modal diffusion model-based prior significantly outperforms existing methods. These results indicate that our model captures a broader spectrum of plausible human poses.

Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior

TL;DR

MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters, offering powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text.

Abstract

The Skinned Multi-Person Linear (SMPL) model plays a crucial role in 3D human pose estimation, providing a streamlined yet effective representation of the human body. However, ensuring the validity of SMPL configurations during tasks such as human mesh regression remains a significant challenge , highlighting the necessity for a robust human pose prior capable of discerning realistic human poses. To address this, we introduce MOPED: \underline{M}ulti-m\underline{O}dal \underline{P}os\underline{E} \underline{D}iffuser. MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters. Our method offers powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text. This capability enhances the applicability of our approach by incorporating additional context often overlooked in traditional pose priors. Extensive experiments across three distinct tasks-pose estimation, pose denoising, and pose completion-demonstrate that our multi-modal diffusion model-based prior significantly outperforms existing methods. These results indicate that our model captures a broader spectrum of plausible human poses.

Paper Structure

This paper contains 13 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: MOPED is a flexible model that has learned to accurately represent realistic poses and condition itself on multiple modalities. It is capable of being deployed across a variety of tasks. In this work we specifically showcase results across 3D human mesh estimation, pose generation, and pose completion.
  • Figure 2: The architecture of MOPED is a transformer based model designed to specifically model the intra-joint relationship through the Pose-Oriented Self-Attention li2023pose. Given some an input pose, $z_t$, we pass that through the our model $\mathcal{G}$, which consists of blocks containing a Pose-Self Attention layer li2023pose and a cross-attention vaswani2017attention mechanism which allows the context, $c$, to be incorporated into the model. This gives us the estimated pose, $\hat{z}_0$. MOPED is flexible to the input and can be conditioned on multiple modalities, (images and/or natural language), or sampled unconditionally.
  • Figure 3: On the left hand side we see that MOPED is able to both unconditionally (top row) and conditionally (bottom row) generate realistic poses with significant diversity. In contrast we observe that DPoser tends to produce realistic poses with much more limited diversity. In the case of NRDF, we find that while it exhibits much greater diversity, it suffers from significantly less realism. Additional examples can be found in the Supplemental Material.
  • Figure 4: Compared to other pose priors, MOPED is able to effectively leverage the HMR2.0 initialization. On the top row we see that MOPED results in far better fitting than the other pose priors. On the bottom row we see that MOPED results in improved alignment with the right leg while maintaining the right arm alignment. Additional results with larger images are in the Supplemental Material.
  • Figure 5: MOPED is capable of more realistic and diverse pose completion from sparse observations. We observe that when the arm is occluded we find that NRDF tends to produce the same results. Meanwhile, when only the end effectors are visible DPoser tends to produce poses that visually look unnatural. MOPED however is able to handle all three cases and produce realistic poses. Additional results with larger images and text-conditioned pose completion are in the Supplemental Material.