Table of Contents
Fetching ...

GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang

TL;DR

This work tackles the challenge of generating 2D human pose skeletons directly from natural language, addressing limitations of GAN-based approaches in joint accuracy and skeletal proportions. It introduces PoseDiffusion, a diffusion-model framework in which each keypoint is represented as a heatmap and conditioned on text through cross-attention, with GUNet incorporating a graph convolutional spatial block to enforce skeletal structure. Experimental results on COCO show that PoseDiffusion achieves higher accuracy and greater diversity than GAN baselines and delivers more aesthetically coherent skeletons when used with ControlNet for image generation. The decoupled keypoint representation enables future extensions to multi-person pose generation and fine-grained body-part control via prompt-based masking.

Abstract

Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.

GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

TL;DR

This work tackles the challenge of generating 2D human pose skeletons directly from natural language, addressing limitations of GAN-based approaches in joint accuracy and skeletal proportions. It introduces PoseDiffusion, a diffusion-model framework in which each keypoint is represented as a heatmap and conditioned on text through cross-attention, with GUNet incorporating a graph convolutional spatial block to enforce skeletal structure. Experimental results on COCO show that PoseDiffusion achieves higher accuracy and greater diversity than GAN baselines and delivers more aesthetically coherent skeletons when used with ControlNet for image generation. The decoupled keypoint representation enables future extensions to multi-person pose generation and fine-grained body-part control via prompt-based masking.

Abstract

Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
Paper Structure (20 sections, 8 equations, 7 figures, 2 tables)

This paper contains 20 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustrations of the wrong skeletons generated by GANs-based method. Fig. \ref{['fig:1(a)']} shows a disproportionate skeleton, and Fig. \ref{['fig:1(b)']}shows a skeleton with misplaced key points, and Fig. \ref{['fig:1(c)']}shows a deformed and twisted skeleton.
  • Figure 2: Heatmap of a pose skeleton with 17 key points.
  • Figure 3: Pipeline overview. In Diffusion Process, we transform the pose skeleton into a set of heatmaps via the Pose2Heatmap module and then add noise to them on each timestep. The inputs to the Denoising Model are noisy latent features from Diffusion Process and text embedding with timestep embedding in the training process, and the output is the predicted noise on the input timestep. Reverse Process samples the noise to obtain heatmaps that match the input text and transform them into the pose skeleton via the Heatmap2Pose module.
  • Figure 4: GUNet. GUNet is a U-Net like structure consisting of three downsampling blocks, three upsampling blocks, an middle block and a spatial block. The sampling blocks and the middle block consist of CNN layers, a self-attention layer, and a cross-attention layer. The middle block is followed by a spatial block containing a graph convolutional neural network layer and a skip connection.
  • Figure 5: Some qualitative poses generated by the model, using the ground truth pose in the first line as a reference.
  • ...and 2 more figures