Table of Contents
Fetching ...

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Chengrui Wang, Pengfei Liu, Min Zhou, Ming Zeng, Xubin Li, Tiezheng Ge, Bo zheng

TL;DR

RHanDS addresses the instability of hand structures in diffusion-generated images by decoupling style and structure guidance into a two-stage framework. A VAE and a conditional U-Net are guided respectively by a style encoder (CLIP-based) and a structure encoder (depth from a reconstructed hand mesh), enabling precise region-focused repainting of malformed hands. The authors introduce three multi-style datasets to support separate learning of style and structure, and demonstrate through extensive experiments that RHanDS improves both structural accuracy (MPJPE) and style consistency (FID, Style Loss) across multiple styles, outperforming HandRefiner baselines. This approach offers a practical pathway to high-fidelity, style-consistent hand generation in diffusion models, with broad implications for content realism in hand-rich imagery.

Abstract

Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. In this paper, we introduce RHanDS, a conditional diffusion-based framework designed to refine malformed hands by utilizing decoupled structure and style guidance. The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand, while the malformed hand itself provides style guidance for preserving the style of the hand. To alleviate the mutual interference between style and structure guidance, we introduce a two-stage training strategy and build a series of multi-style hand datasets. In the first stage, we use paired hand images for training to ensure stylistic consistency in hand refining. In the second stage, various hand images generated based on human meshes are used for training, enabling the model to gain control over the hand structure. Experimental results demonstrate that RHanDS can effectively refine hand structure while preserving consistency in hand style.

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

TL;DR

RHanDS addresses the instability of hand structures in diffusion-generated images by decoupling style and structure guidance into a two-stage framework. A VAE and a conditional U-Net are guided respectively by a style encoder (CLIP-based) and a structure encoder (depth from a reconstructed hand mesh), enabling precise region-focused repainting of malformed hands. The authors introduce three multi-style datasets to support separate learning of style and structure, and demonstrate through extensive experiments that RHanDS improves both structural accuracy (MPJPE) and style consistency (FID, Style Loss) across multiple styles, outperforming HandRefiner baselines. This approach offers a practical pathway to high-fidelity, style-consistent hand generation in diffusion models, with broad implications for content realism in hand-rich imagery.

Abstract

Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. In this paper, we introduce RHanDS, a conditional diffusion-based framework designed to refine malformed hands by utilizing decoupled structure and style guidance. The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand, while the malformed hand itself provides style guidance for preserving the style of the hand. To alleviate the mutual interference between style and structure guidance, we introduce a two-stage training strategy and build a series of multi-style hand datasets. In the first stage, we use paired hand images for training to ensure stylistic consistency in hand refining. In the second stage, various hand images generated based on human meshes are used for training, enabling the model to gain control over the hand structure. Experimental results demonstrate that RHanDS can effectively refine hand structure while preserving consistency in hand style.
Paper Structure (18 sections, 6 equations, 7 figures, 3 tables)

This paper contains 18 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Examples of hands refined by our RHanDS (right in each pair) from the malformed hands (left in each pair).
  • Figure 2: The RHanDS framework we propose contains four modules: a VAE for projecting images into a latent space and reconstructing images from the latent, a conditional U-net for predicting the denoised variant during the denoising process, a style encoder to extract hand style from the malformed hand and map it into the U-net via cross-attention, and a structure encoder to utilize the hand mesh reconstructed from the malformed hand to guide the hand structure. In addition, to achieve a fully automatic process, a hand detection model and a 3D hand reconstruction model are required.
  • Figure 3: The visual comparison of RHanDS with other methods on different styles of malformed hands. The malformed hands are generated based on the original hands, and the structure guidances are reconstructed from the original hands.
  • Figure 4: The first stage. In this stage, U-net and style encoder are trained using Multi-Style Paired Hand Dataset for style guidance.
  • Figure 5: The second stage. In this stage, the structure encoder is trained using Multi-Style Hand-Mesh Dataset for structure guidance.
  • ...and 2 more figures