Table of Contents
Fetching ...

PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht

Abstract

Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.

PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Abstract

Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.

Paper Structure

This paper contains 30 sections, 3 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: PoseDreamer Pipeline: Our method builds a high-quality synthetic human dataset using aligned controllable diffusion, hard sample mining, and filtering. It produces photorealistic images with precise spatial control and diverse scenarios for robust training. With fixed SMPL-X smplx parameters, varying only the text caption yields diverse, controllable changes in background, clothing, and environment.
  • Figure 2: Examples from the PoseDreamer Dataset: High-quality synthetic human samples generated using our pipeline. The examples highlight photorealistic appearance, precise spatial control, and diverse, challenging scenarios that support robust model training.
  • Figure 3: Data Generation. Our pipeline begins with SMPL-X smplx parameter and caption generation from multiple sources, followed by curriculum-based hard sample mining to select challenging poses, and concludes with DPO-aligned controllable image generation and comprehensive quality filtering.
  • Figure 4: DPO Alignment Effectiveness: Comparison between baseline control model (top row) and DPO-aligned model (bottom row) using identical SMPL-X parameters and captions. The aligned model generates more consistent poses with greater adherence to the ground truth 3D parameters, demonstrating improved control precision and reduced pose deviations.
  • Figure 5: OKS scores: Comparison between images with low and high OKS values. Only images with an OKS score greater than 0.8 pass the filtering stage.
  • ...and 7 more figures