Table of Contents
Fetching ...

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Junyan Wang, Zhenhong Sun, Zhiyu Tan, Xuanbai Chen, Weihua Chen, Hao Li, Cheng Zhang, Yang Song

TL;DR

This paper proposes a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps to ensure semantic detail richness and human structural accuracy during fine-tuning, and introduces scale-aware and step-wise constraints within the diffusion process.

Abstract

Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

TL;DR

This paper proposes a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps to ensure semantic detail richness and human structural accuracy during fine-tuning, and introduces scale-aware and step-wise constraints within the diffusion process.

Abstract

Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.
Paper Structure (26 sections, 10 equations, 19 figures, 3 tables)

This paper contains 26 sections, 10 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Existing text-to-image models often struggle to generate human images with accurate anatomy (upper branch). We incorporate human-centric priors into the model fine-tuning stage to rectify this imperfection (bottom branch). The learned model can synthesize high-quality human images from text without requiring additional conditions at the inference stage.
  • Figure 2: Average cross-attention maps across all timestamps of a text-conditioned diffusion process. These maps contain semantic relations with texts that affect the generated image, exemplified by the inaccurate duplication of legs in the generated human figure.
  • Figure 3: The cross-attention maps, as influenced by the fixed token 'yoga', are across various stages of the U-Net architecture at different inference timesteps. The vertical axis represents the inference timestep when using DDIM song2020denoising, while the horizontal axis corresponds to the different scale stages within the U-Net framework. The right side displays generated images at each step.
  • Figure 4: Overview of the proposed learnable Human-centric Prior layer training in the frozen pre-trained latent diffusion model. The left part shows the process of human-centric text tokens extraction, the middle part indicates the overall process of the HcP layer plugged into the U-Net framework, and the right part shows the HcP layer training with the proposed human-centric alignment loss.
  • Figure 5: Alignment of layer-specific ResNet features with corresponding scale ([64$^2$,32$^2$,16$^2$,8$^2$]) human-centric attention maps in each cross-attention layer of the U-Net architecture for human-centric alignment loss
  • ...and 14 more figures