Table of Contents
Fetching ...

Toward Human Understanding with Controllable Synthesis

Hanz Cuevas-Velasquez, Priyanka Patel, Haiwen Feng, Michael Black

TL;DR

It is shown, for the first time, that generative image models can be controlled by traditional graphics methods to produce training data that increases the accuracy of HPS methods.

Abstract

Training methods to perform robust 3D human pose and shape (HPS) estimation requires diverse training images with accurate ground truth. While BEDLAM demonstrates the potential of traditional procedural graphics to generate such data, the training images are clearly synthetic. In contrast, generative image models produce highly realistic images but without ground truth. Putting these methods together seems straightforward: use a generative model with the body ground truth as controlling signal. However, we find that, the more realistic the generated images, the more they deviate from the ground truth, making them inappropriate for training and evaluation. Enhancements of realistic details, such as clothing and facial expressions, can lead to subtle yet significant deviations from the ground truth, potentially misleading training models. We empirically verify that this misalignment causes the accuracy of HPS networks to decline when trained with generated images. To address this, we design a controllable synthesis method that effectively balances image realism with precise ground truth. We use this to create the Generative BEDLAM (Gen-B) dataset, which improves the realism of the existing synthetic BEDLAM dataset while preserving ground truth accuracy. We perform extensive experiments, with various noise-conditioning strategies, to evaluate the tradeoff between visual realism and HPS accuracy. We show, for the first time, that generative image models can be controlled by traditional graphics methods to produce training data that increases the accuracy of HPS methods.

Toward Human Understanding with Controllable Synthesis

TL;DR

It is shown, for the first time, that generative image models can be controlled by traditional graphics methods to produce training data that increases the accuracy of HPS methods.

Abstract

Training methods to perform robust 3D human pose and shape (HPS) estimation requires diverse training images with accurate ground truth. While BEDLAM demonstrates the potential of traditional procedural graphics to generate such data, the training images are clearly synthetic. In contrast, generative image models produce highly realistic images but without ground truth. Putting these methods together seems straightforward: use a generative model with the body ground truth as controlling signal. However, we find that, the more realistic the generated images, the more they deviate from the ground truth, making them inappropriate for training and evaluation. Enhancements of realistic details, such as clothing and facial expressions, can lead to subtle yet significant deviations from the ground truth, potentially misleading training models. We empirically verify that this misalignment causes the accuracy of HPS networks to decline when trained with generated images. To address this, we design a controllable synthesis method that effectively balances image realism with precise ground truth. We use this to create the Generative BEDLAM (Gen-B) dataset, which improves the realism of the existing synthetic BEDLAM dataset while preserving ground truth accuracy. We perform extensive experiments, with various noise-conditioning strategies, to evaluate the tradeoff between visual realism and HPS accuracy. We show, for the first time, that generative image models can be controlled by traditional graphics methods to produce training data that increases the accuracy of HPS methods.

Paper Structure

This paper contains 29 sections, 24 figures, 4 tables.

Figures (24)

  • Figure 1: Generative BEDLAM (Gen-B) is a dataset that takes traditionally rendered images with perfect ground truth 3D body shape and pose information and "upgrades" their realism using a generative diffusion processes that remains faithful to the ground truth. Specifically, we upgrade BEDLAM, a large-scale synthetic video dataset designed to train and test algorithms on the task of 3D human pose and shape estimation. This is challenging because image generation methods produce realistic images, the resulting images may deviate from the ground truth, making them unusable for training or evaluation. To address this, we use metadata provided by BEDLAM to control the generative process. Depending on the noise added during the diffusion step, we can produce more realistic images that preserve the pose and shape of the person. We show, for the first time, that such a generative approach produces a training dataset that improves the accuracy of 3D human pose and shape estimation.
  • Figure 2: Gen-B pipeline applied to "head". We use the GT color image from BEDLAM and add noise for $t$ steps. The mask of the head and the embeddings are used to inpaint the region of the head. To preserve the shape and pose we use the surface normals, pose, depth, and edges as control signals.
  • Figure 3: Gen-B process. Our method takes as input a BEDLAM image and processes each synthetic body in the image. For each person, it crops around it using the body part masks and uses the cropped region as input for our pipeline. Our pipeline uses depth, edges from depth, surface normals and 2D poses as control signal for the multi-ControlNet network. Our pipeline prcesses sequentially the head, then the hair, body and finally the feet of the person. Once the generation is done, it continues to the next person.
  • Figure 4: Inpainting errors. When inpainting is performed on the masked body, the body shape, hand, and face changes.
  • Figure 5: ControlNet-generated images. (better seen when zoomed-in). At first glance, it looks like the generated images with different conditioning images successfully convert BEDLAM into a well-aligned photorealistic image. However, if we look closely, we can observe that they modify parts of the body (red circles), which creates a mismatch with the GT mesh data. When we combine all the control signals, we manage to enforce the shape and pose consistency. We overlapped the images with the edges of the body to highlight the changes.
  • ...and 19 more figures