Table of Contents
Fetching ...

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Jiajun Wang, Morteza Ghahremani, Yitong Li, Björn Ommer, Christian Wachinger

TL;DR

Stable-Pose is a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models and leverages the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons.

Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

TL;DR

Stable-Pose is a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models and leverages the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons.

Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.
Paper Structure (18 sections, 7 equations, 14 figures, 17 tables, 1 algorithm)

This paper contains 18 sections, 7 equations, 14 figures, 17 tables, 1 algorithm.

Figures (14)

  • Figure 1: Stable-Pose leverages the patch-wise attention of ViTs to address the complex pose conditioning problem in T2I generation, showing superior performance compared to current techniques.
  • Figure 2: The Stable Diffusion architecture with Stable-Pose: operating on the pose skeleton image, Stable-Pose integrates a trainable ViT unit into the frozen-weight Stable Diffusion rombach2022high to improve the generation of pose-guided human images.
  • Figure 3: Stable-Pose consists of a pose encoder $\beta_\theta$ and a coarse-to-fine Pose-Masked Self-Attention (PMSA) ViT $\mathcal{F}_\theta$ for seeking the patch-wise relationship of human parts. PMSA restricts attention to embedding tokens within a specific attention mask to ensure that each embedding token can only attend to pose embedding tokens, not non-pose ones.
  • Figure 4: Qualitative results of SOTA techniques and our Stable-Pose on Human-Art (first two rows) and LAION-Human (last two rows). An illustration of the pose input is shown in Figure \ref{['fig:illus_pose']}.
  • Figure 5: Ablation on pose-mask guidance strength in the loss function.
  • ...and 9 more figures