Table of Contents
Fetching ...

Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

Anton Pelykh, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR

This work tackles the difficulty of generating anatomically accurate hands with precise pose control in diffusion-based human image synthesis. It introduces a two-stage framework: a pose-conditioned hand generator that also outputs a hand segmentation mask, followed by a body outpainting stage around the hands using a skeleton-conditioned ControlNet, with a sequential mask-expansion blending strategy to ensure coherence. The approach achieves superior pose accuracy and image quality on the HaGRID dataset, with ablations validating the effectiveness of the blending method and the two-stage design. Overall, the method enhances hand realism and pose controllability in conditional human image generation, enabling more reliable and controllable synthesis for downstream applications; code is released.

Abstract

Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the outpainted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code of the proposed approach is available at https://github.com/apelykh/hand-to-diffusion.

Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

TL;DR

This work tackles the difficulty of generating anatomically accurate hands with precise pose control in diffusion-based human image synthesis. It introduces a two-stage framework: a pose-conditioned hand generator that also outputs a hand segmentation mask, followed by a body outpainting stage around the hands using a skeleton-conditioned ControlNet, with a sequential mask-expansion blending strategy to ensure coherence. The approach achieves superior pose accuracy and image quality on the HaGRID dataset, with ablations validating the effectiveness of the blending method and the two-stage design. Overall, the method enhances hand realism and pose controllability in conditional human image generation, enabling more reliable and controllable synthesis for downstream applications; code is released.

Abstract

Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the outpainted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code of the proposed approach is available at https://github.com/apelykh/hand-to-diffusion.
Paper Structure (19 sections, 9 equations, 8 figures, 2 tables)

This paper contains 19 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Examples of images, generated by the proposed method (column 6) and the state-of-the-art diffusion models (columns 1 to 5), given the pose condition (final column) and the text description. The text prompts are provided in the supplementary material.
  • Figure 2: General overview of the proposed approach. We divide image generation into two sub-tasks: (I) hand generation (top part) and (II) body outpainting around the hands (bottom part).
  • Figure 3: Segmentation masks extracted with SAM (left), masks after applying a dilation kernel (middle), pixel-wise difference between the two (right)
  • Figure 4: Visualization of the InceptionV3 convolutional features from the layer with feature dimension 192.
  • Figure 5: Examples of images from the HaGRID dataset with severe background clutter.
  • ...and 3 more figures