Table of Contents
Fetching ...

HanDrawer: Leveraging Spatial Information to Render Realistic Hands Using a Conditional Diffusion Model in Single Stage

Qifan Fu, Xu Chen, Muhammad Asad, Shanxin Yuan, Changjae Oh, Gregory Slabaugh

TL;DR

This work targets the persistent challenge of generating realistic hand gestures with diffusion models. It introduces HanDrawer, a spatially aware conditioning module that learns hand priors from MANO meshes and fuses them with diffusion features through cross-attention, augmented by a Position-Preserving Zero Padding fusion strategy. The authors curate a high-quality local-context HaGRID-based multimodal dataset (text, depth, and MANO-vertices) and train HanDrawer jointly with ControlNet in a one-stage gesture generation pipeline. Quantitative and qualitative results on HaGRID show state-of-the-art performance, demonstrating improved hand realism and pose accuracy, with potential benefits for avatars, VR, and assistive technologies; code and enhanced data will be released publicly if accepted.

Abstract

Although diffusion methods excel in text-to-image generation, generating accurate hand gestures remains a major challenge, resulting in severe artifacts, such as incorrect number of fingers or unnatural gestures. To enable the diffusion model to learn spatial information to improve the quality of the hands generated, we propose HanDrawer, a module to condition the hand generation process. Specifically, we apply graph convolutional layers to extract the endogenous spatial structure and physical constraints implicit in MANO hand mesh vertices. We then align and fuse these spatial features with other modalities via cross-attention. The spatially fused features are used to guide a single stage diffusion model denoising process for high quality generation of the hand region. To improve the accuracy of spatial feature fusion, we propose a Position-Preserving Zero Padding (PPZP) fusion strategy, which ensures that the features extracted by HanDrawer are fused into the region of interest in the relevant layers of the diffusion model. HanDrawer learns the entire image features while paying special attention to the hand region thanks to an additional hand reconstruction loss combined with the denoising loss. To accurately train and evaluate our approach, we perform careful cleansing and relabeling of the widely used HaGRID hand gesture dataset and obtain high quality multimodal data. Quantitative and qualitative analyses demonstrate the state-of-the-art performance of our method on the HaGRID dataset through multiple evaluation metrics. Source code and our enhanced dataset will be released publicly if the paper is accepted.

HanDrawer: Leveraging Spatial Information to Render Realistic Hands Using a Conditional Diffusion Model in Single Stage

TL;DR

This work targets the persistent challenge of generating realistic hand gestures with diffusion models. It introduces HanDrawer, a spatially aware conditioning module that learns hand priors from MANO meshes and fuses them with diffusion features through cross-attention, augmented by a Position-Preserving Zero Padding fusion strategy. The authors curate a high-quality local-context HaGRID-based multimodal dataset (text, depth, and MANO-vertices) and train HanDrawer jointly with ControlNet in a one-stage gesture generation pipeline. Quantitative and qualitative results on HaGRID show state-of-the-art performance, demonstrating improved hand realism and pose accuracy, with potential benefits for avatars, VR, and assistive technologies; code and enhanced data will be released publicly if accepted.

Abstract

Although diffusion methods excel in text-to-image generation, generating accurate hand gestures remains a major challenge, resulting in severe artifacts, such as incorrect number of fingers or unnatural gestures. To enable the diffusion model to learn spatial information to improve the quality of the hands generated, we propose HanDrawer, a module to condition the hand generation process. Specifically, we apply graph convolutional layers to extract the endogenous spatial structure and physical constraints implicit in MANO hand mesh vertices. We then align and fuse these spatial features with other modalities via cross-attention. The spatially fused features are used to guide a single stage diffusion model denoising process for high quality generation of the hand region. To improve the accuracy of spatial feature fusion, we propose a Position-Preserving Zero Padding (PPZP) fusion strategy, which ensures that the features extracted by HanDrawer are fused into the region of interest in the relevant layers of the diffusion model. HanDrawer learns the entire image features while paying special attention to the hand region thanks to an additional hand reconstruction loss combined with the denoising loss. To accurately train and evaluate our approach, we perform careful cleansing and relabeling of the widely used HaGRID hand gesture dataset and obtain high quality multimodal data. Quantitative and qualitative analyses demonstrate the state-of-the-art performance of our method on the HaGRID dataset through multiple evaluation metrics. Source code and our enhanced dataset will be released publicly if the paper is accepted.

Paper Structure

This paper contains 20 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Accurate and realistic hand gesture generation. ControlNet zhang2023adding can guide Stable Diffusion to generate accurate body poses, but the precision and realism of the generated hand gestures are quite poor (first row). Our method (fourth row) improves the precision and realism of generated gestures by learning the complex intrinsic structure and spatial information of gestures to better match the ground truth pose (bottom row) compared to other methods (second and third rows).
  • Figure 2: Low quality samples in HaGRID kapitanov2024hagrid dataset. The HaGRID dataset was constructed for gesture recognition tasks only with a hand bounding box and a gesture label, and lacks the necessary annotations for generation tasks. Additionally, the dataset contains many samples with issues such as blurriness, occluded face or hands, substandard gestures or incorrect annotations (compared with standardised gestures in the lower right corner), as well as images that could cause automated annotator hallucinations. Therefore, it is necessary to first clean the HaGRID dataset and then re-annotate it with the labels required for generation tasks.
  • Figure 3: HanDrawer training pipeline. HanDrawer and ControlNet are jointly trained using a reconstruction loss and a denoising loss.
  • Figure 4: HanDrawer inference pipeline. HanDrawer extracts intrinsic structural features of gestures and region localization from gesture labels, hand region depth maps, and MANO mesh parameters. These spatial features are then fused with appearance features and fed into specific layers of the diffusion model.
  • Figure 5: HanDrawer architecture. HanDrawer extracts features of three modalities through self-attention layers and fuses spatial features with these features via cross-attention layers. The gray cross-attention layers belong to the diffusion model. The spatially fused multimodal features extracted by HanDrawer enable the diffusion model to gain spatial understanding of the hand region. While extracting features, HanDrawer reconstructs the hand region, introducing a reconstruction loss to direct the model’s focus to the hand region.
  • ...and 2 more figures