Table of Contents
Fetching ...

Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

Benito Buchheim, Max Reimann, Jürgen Döllner

TL;DR

A domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain is proposed.

Abstract

We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model's visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.

Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

TL;DR

A domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain is proposed.

Abstract

We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model's visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.

Paper Structure

This paper contains 23 sections, 6 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Our approach allows 3d parametric control over human pose and shape (a) in LDMs using SMPL SMPL:2015 meshes. We train on synthetic data (b) and propose a domain adaptation technique to adapt model outputs into the original visual domain.
  • Figure 2: A pretrained ControlNet $\epsilon_{SD}$ conditioned on 2d poses (a) can generate pose-guided images in the data domain $p_{SD}$ (blue) of the Stable Diffusion model. To enable SMPL-based human shape and 3d-pose control, the model is fine-tuned on a synthetic dataset (b), shifting the model outputs into the synthetic data domain $p_{Syn}$ (orange). Our approach proposes classifier-free guidance composition (c) to adapt the visual output domain to the original data domain while retaining shape and pose control.
  • Figure 3: Overview of the networks involved in our SD-control approach during one denoising timestep $t$. During finetuning (solid lines), the overall model output $\epsilon_{\text{Syn}}(c_s,c_o) = \epsilon_\theta(z_t,t,\emptyset,C_\text{SMPL}(\boldsymbol{c}_{\text{s}}, \boldsymbol{c}_{\text{o}}))$ is adapted to a synthetic SMPL image dataset. During inference (solid and dotted lines), a pose-conditioned guidance network ($C_\text{SD}$) is executed alongside $C_\text{SMPL}$ and a composite guidance vector is constructed from the outputs.
  • Figure 4: Varying shape parameters for fixed pose and prompt
  • Figure 5: Sampling different shape parameters. The smpl shape (top row) is able to be accurately represented in the image of the clothed persons.
  • ...and 12 more figures