Table of Contents
Fetching ...

From Text to Pose to Image: Improving Diffusion Model Control and Quality

Clément Bonnet, Ariel N. Lee, Franck Wertel, Antoine Tamano, Tanguy Cizain, Pablo Ducru

TL;DR

A text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models.

Abstract

In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at https://github.com/clement-bonnet/text-to-pose.

From Text to Pose to Image: Improving Diffusion Model Control and Quality

TL;DR

A text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models.

Abstract

In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at https://github.com/clement-bonnet/text-to-pose.

Paper Structure

This paper contains 20 sections, 1 theorem, 1 equation, 8 figures.

Key Result

Theorem 1

Let $X$ be a random variable, with a probability distribution density $p(X)$, i.e. $X \sim p(X)$. Let $T \in \mathbb{R}_+$ be a real positive "temperature" parameter. We define the "tempered distribution transform" as $X_T \sim p_T(X)$, where: Properties -- the tempered distribution $p_T$ is related to the original one $p$ by the following properties: Sampling scheme -- to sample from the temp

Figures (8)

  • Figure 1: Text-to-Pose transformer architecture.
  • Figure 2: CLaPP scores with 95% confidence intervals. The win-rate ratio of T2P over KNN is 78%. We use a subset of 100 (caption, pose) pairs from the COCO 2017 validation dataset.
  • Figure 3: Performance of pose-conditioned image generation for the Tencent adapter and ours. (a): Aesthetic score (ML based). (b): Human Preference Score v2 wu2023human. (c): Human preferences (manually annotating). Error bars represent two standard deviations. We use a subset of 100 (caption, pose) pairs from the COCO 2017 validation dataset to serve as conditions for image generation.
  • Figure 4: Text-to-pose-to-image framework.
  • Figure 5: CLaPP scores on 5 poses and corresponding captions from the COCO dataset. The scores measure the compatibility between text and poses.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: Tempered distribution sampling
  • Proof 1