Table of Contents
Fetching ...

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ishita Dasgupta, Saayan Mitra, Minh Hoai

TL;DR

This work proposes a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process, and incorporates multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference.

Abstract

Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

TL;DR

This work proposes a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process, and incorporates multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference.

Abstract

Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.
Paper Structure (15 sections, 3 equations, 12 figures, 3 tables)

This paper contains 15 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Generating realistic hands. Text-to-Image generative models, e.g., Rombach_2022_CVPR, often produce various hand artifacts (top row). We inject hand embeddings, capturing hand shapes, poses, and articulations, in the generation process to generate realistic hands (bottom row).
  • Figure 2: HanDiffuser architecture. Our architecture consists of two components. The first component, Text-to-Hand-Params (T2H), takes the text as input and generates body and hand parameters. The second component, Text-Guided Hand-Params-to-Image (T-H2I), uses the hand parameters from the first component and the text to generate images with high-quality hands. The Text+Hand encoder jointly encodes hand parameters and text, and captures hand pose, articulation, and shape.
  • Figure 3: Qualitative results. We compare the quality of hands in images generated by different methods from the same text prompts. (Images are generated at 512x512 resolution)
  • Figure 4: Illustrative SMPL-H results, generated from our Text-to-Hand-Params model.
  • Figure 5: Generating images from text via SMPL-H. The intermediate SMPL-H representations are essential in generating realistic hand appearances.
  • ...and 7 more figures