Table of Contents
Fetching ...

Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains

Zhenzhen Weng, Laura Bravo-Sánchez, Serena Yeung-Levy

TL;DR

Diffusion-HPC addresses the gap where text-conditioned diffusion models produce unrealistic human anatomies, hindering 3D pose understanding. It injects SMPL-based pose priors into the diffusion process to generate photo-realistic human figures paired with ground-truth 3D meshes, enabling synthetic data for few-shot HMR adaptation. The approach yields improvements in HMR metrics (MPJPE, PA-MPJPE) on challenging sports domains and delivers higher-quality, pose-realistic images for both text- and pose-conditioned generation, outperforming several baselines. This training-free method enhances the utility of large diffusion models for 3D human perception tasks by supplying scalable, labeled synthetic data without domain-specific finetuning.

Abstract

Recent text-to-image generative models have exhibited remarkable abilities in generating high-fidelity and photo-realistic images. However, despite the visually impressive results, these models often struggle to preserve plausible human structure in the generations. Due to this reason, while generative models have shown promising results in aiding downstream image recognition tasks by generating large volumes of synthetic data, they are not suitable for improving downstream human pose perception and understanding. In this work, we propose a Diffusion model with Human Pose Correction (Diffusion-HPC), a text-conditioned method that generates photo-realistic images with plausible posed humans by injecting prior knowledge about human body structure. Our generated images are accompanied by 3D meshes that serve as ground truths for improving Human Mesh Recovery tasks, where a shortage of 3D training data has long been an issue. Furthermore, we show that Diffusion-HPC effectively improves the realism of human generations under varying conditioning strategies.

Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains

TL;DR

Diffusion-HPC addresses the gap where text-conditioned diffusion models produce unrealistic human anatomies, hindering 3D pose understanding. It injects SMPL-based pose priors into the diffusion process to generate photo-realistic human figures paired with ground-truth 3D meshes, enabling synthetic data for few-shot HMR adaptation. The approach yields improvements in HMR metrics (MPJPE, PA-MPJPE) on challenging sports domains and delivers higher-quality, pose-realistic images for both text- and pose-conditioned generation, outperforming several baselines. This training-free method enhances the utility of large diffusion models for 3D human perception tasks by supplying scalable, labeled synthetic data without domain-specific finetuning.

Abstract

Recent text-to-image generative models have exhibited remarkable abilities in generating high-fidelity and photo-realistic images. However, despite the visually impressive results, these models often struggle to preserve plausible human structure in the generations. Due to this reason, while generative models have shown promising results in aiding downstream image recognition tasks by generating large volumes of synthetic data, they are not suitable for improving downstream human pose perception and understanding. In this work, we propose a Diffusion model with Human Pose Correction (Diffusion-HPC), a text-conditioned method that generates photo-realistic images with plausible posed humans by injecting prior knowledge about human body structure. Our generated images are accompanied by 3D meshes that serve as ground truths for improving Human Mesh Recovery tasks, where a shortage of 3D training data has long been an issue. Furthermore, we show that Diffusion-HPC effectively improves the realism of human generations under varying conditioning strategies.
Paper Structure (16 sections, 4 equations, 10 figures, 4 tables)

This paper contains 16 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We propose Diffusion model with Human Pose Correction (Diffusion-HPC), a synthetic image generation strategy with paired with ground-truth meshes to improve the performance of Human Mesh Recovery (HMR) models on domains with challenging poses and/or limited data. Diffusion-HPC is a text-conditioned method that addresses the implausibility of human generations from Stable Diffusion Rombach_2022_CVPR, a large pre-trained text-conditioned generative model, while preserving the inherent flexibility of such models.
  • Figure 2: Overview of Diffusion-HPC. The generation process can be broken down into 3 steps. Step 1: Obtaining image latents $z$ from the initial generation $\mathcal{I}$ of a pre-trained text-to-image model (i.e. Stable Diffusion Rombach_2022_CVPR) and injecting noise. Step 2: Estimating human body mesh $\mathcal{M} (\theta, \beta)$ from $\mathcal{I}$. If the pose is challenging based on a pose prior (i.e. VPoser SMPL-X:2019) then render the mesh's depth map $d_{fg}$ and introduce occlusions via object masks obtained from a segmentation model. Step 3: Using the latents $z$, foreground depths, and the text embeddings $t$ as guide for the final generation $\mathcal{I^*}$.
  • Figure 3: Qualitative HMR results on SMART and Ski-Pose datasets. Finetuning with data from Diffusion-HPC (rightmost) helps HMR models learn novel poses from challenging domains.
  • Figure 4: Comparison with Stable Diffusion Rombach_2022_CVPR on text-conditioned generations. Red arrows point out implausible body parts in Stable Diffusion generations. To show a spectrum of varying pose difficulty levels, we present generations from the 5%, 50%, 95% quantiles (i.e. from easy to hard) in terms of VPoser score. Rendered depths are included to show correct pose guidance.
  • Figure S5: Qualitative comparisons to brooks2022hallucinating (input 2D keypoints are overlaid on the bottom left). Our generations conditioned on text (T), real images (R), and in-domain (D).
  • ...and 5 more figures