Table of Contents
Fetching ...

Diffusion Models are Efficient Data Generators for Human Mesh Recovery

Yongtao Ge, Wenjia Wang, Yongfan Chen, Fanzhou Wang, Lei Yang, Hao Chen, Chunhua Shen

TL;DR

This work tackles the data scarcity challenge in 3D human pose and shape estimation by introducing HumanWild, a diffusion-model–driven pipeline that generates in-the-wild human images with aligned 3D annotations. It conditions image synthesis on SMPL-X–derived maps (keypoints, depth, normals) via a fine-tuned multi-condition ControlNet, using a large, curated dataset of real and CG data and a denoising/refinement stage (SAM, RTMPose, SMPLify) to ensure alignment. Experiments show that this diffusion-generated data complements CG-rendered data, improving HPS performance across benchmarks like 3DPW, AGORA, and RICH, and enabling scalable in-the-wild training. The approach holds promise for expanding high-quality 3D human datasets without mocap or heavy rendering pipelines, with potential extensions to multi-human scenes and other 3D perception tasks.

Abstract

Despite remarkable progress having been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by a rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. Specifically, we first collect a large-scale human-centric dataset with comprehensive annotations, e.g, text captions, the depth map, and surface normal images. To generate a wide variety of human images with initial labels, we train a customized, multi-condition ControlNet model. The key to this process is using a 3D parametric model, e.g, SMPL-X, to create various condition inputs easily. Our data generation pipeline is both flexible and customizable, making it adaptable to multiple real-world tasks, such as human interaction in complex scenes and humans captured by wide-angle lenses. By relying solely on generative models, we can produce large-scale, in-the-wild human images with high-quality annotations, significantly reducing the need for manual image collection and annotation. The generated dataset encompasses a wide range of viewpoints, environments, and human identities, ensuring its versatility across different scenarios. We hope that our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.

Diffusion Models are Efficient Data Generators for Human Mesh Recovery

TL;DR

This work tackles the data scarcity challenge in 3D human pose and shape estimation by introducing HumanWild, a diffusion-model–driven pipeline that generates in-the-wild human images with aligned 3D annotations. It conditions image synthesis on SMPL-X–derived maps (keypoints, depth, normals) via a fine-tuned multi-condition ControlNet, using a large, curated dataset of real and CG data and a denoising/refinement stage (SAM, RTMPose, SMPLify) to ensure alignment. Experiments show that this diffusion-generated data complements CG-rendered data, improving HPS performance across benchmarks like 3DPW, AGORA, and RICH, and enabling scalable in-the-wild training. The approach holds promise for expanding high-quality 3D human datasets without mocap or heavy rendering pipelines, with potential extensions to multi-human scenes and other 3D perception tasks.

Abstract

Despite remarkable progress having been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by a rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. Specifically, we first collect a large-scale human-centric dataset with comprehensive annotations, e.g, text captions, the depth map, and surface normal images. To generate a wide variety of human images with initial labels, we train a customized, multi-condition ControlNet model. The key to this process is using a 3D parametric model, e.g, SMPL-X, to create various condition inputs easily. Our data generation pipeline is both flexible and customizable, making it adaptable to multiple real-world tasks, such as human interaction in complex scenes and humans captured by wide-angle lenses. By relying solely on generative models, we can produce large-scale, in-the-wild human images with high-quality annotations, significantly reducing the need for manual image collection and annotation. The generated dataset encompasses a wide range of viewpoints, environments, and human identities, ensuring its versatility across different scenarios. We hope that our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.
Paper Structure (25 sections, 1 equation, 13 figures, 10 tables)

This paper contains 25 sections, 1 equation, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Dataset appearance distributions of synthesized datasets and in-the-wild real-world datasets.
  • Figure 2: The overall pipeline of the proposed controllable data generation. Our ControlNet could be conditioned on fine-grained keypoint maps, depth maps, surface normal maps, and structural text prompts. Our text prompt includes human appearance, pose, and indoor/outdoor scene types based on w. or w/o background normals.
  • Figure 3: Visualization of our rendered human dataset with diverse human identities and in-the-wild scenes. Each data sample contains RGB image (left), a depth map (middle), and a surface normal map (right).
  • Figure 4: Visualization of HPS estimation. For each data sample, left is the original image, middle is trained on BEDLAM dataset, right is finetuned on our HumanWild dataset.
  • Figure 5: [id=R2]Visualization of human images generated by our multi-condition generation pipeline with diverse poses and scenes. For each data sample, the first column is generated from the ControlNet before revision. The second and third columns are generated from our revised model. The fourth, fifth, and sixth columns are keypoint heatmap, depth map, and surface normal map rendered from the SMPL-X model.
  • ...and 8 more figures