Table of Contents
Fetching ...

MGHanD: Multi-modal Guidance for authentic Hand Diffusion

Taehyeon Eum, Jieun Choi, Tae-Kyun Kim

TL;DR

MGHanD addresses realistic hand generation in diffusion-based T2I by applying multi-modal guidance focused on the hand region during inference. It combines a Hand Discriminator for visual feedback and a LoRA-based textual adapter for prompt refinement, mediated by a Cumulative Hand Mask to preserve background fidelity. The approach maintains the original model's style while producing anatomically plausible hands, validated through quantitative metrics, qualitative comparisons, and user studies. This framework enables efficient, non-tuning refinement of hands, with potential applicability to robotics and human-robot interaction where accurate hand representations are critical.

Abstract

Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from `hands' towards more detailed prompts such as `natural hands', and `anatomically correct fingers' at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.

MGHanD: Multi-modal Guidance for authentic Hand Diffusion

TL;DR

MGHanD addresses realistic hand generation in diffusion-based T2I by applying multi-modal guidance focused on the hand region during inference. It combines a Hand Discriminator for visual feedback and a LoRA-based textual adapter for prompt refinement, mediated by a Cumulative Hand Mask to preserve background fidelity. The approach maintains the original model's style while producing anatomically plausible hands, validated through quantitative metrics, qualitative comparisons, and user studies. This framework enables efficient, non-tuning refinement of hands, with potential applicability to robotics and human-robot interaction where accurate hand representations are critical.

Abstract

Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from `hands' towards more detailed prompts such as `natural hands', and `anatomically correct fingers' at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.

Paper Structure

This paper contains 14 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Comparison of Stable Diffusion (left) and our method, MGHanD (right), for text-to-image diffusion in hand-object interaction scenarios. While Stable Diffusion often produces malformed hands with anatomical inconsistencies or blurred details, MGHanD approach refines hand articulation and pose accuracy without compromising Stable Diffusion's overall visual style. Our method preserves the image composition and aesthetic while significantly improving the realism of hand-object interactions.
  • Figure 2: Overview of proposed MGHanD framework. Left side: Whole Pipeline demonstrates the end-to-end process, featuring a Denoising U-Net $\epsilon_\theta$ with multi-modal guidance. The pipeline applies visual (Discriminator) and textual (LoRA) guidance weights to model with cumulative hand region masks, allowing precise control over hand features while maintaining overall image coherence. Top right: Discriminator Training module employs MSE logistic loss on real and fake hand images to enhance realism. Bottom right: LoRA Adapter Training utilizes a stable diffusion backbone for efficient fine-tuning to hand-specific prompts, employing L2 loss to optimize the adapter weights and ensure accurate text-to-hand alignment.
  • Figure 3: Visualization of the diffusion process's effect on the cumulative mask. By accumulating the mask over time steps, the method is robust to occasional errors in hand detection.
  • Figure 4: Qualitative Results: MGHanD vs. existing models. Comparison between our MGHanD (bottom row) and existing methods (SD, ConceptSlider, HandRefiner) for six common hand-object interaction tasks.
  • Figure 5: Ablation study on Multi-modal Guidance. Comparison of original images, MGHanD model, and versions without visual or textual guidance, highlighting the impact of each component.
  • ...and 2 more figures