Table of Contents
Fetching ...

HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

Green Rosh, Prateek Kukreja, Vishakha SR, Pawan Prasad B H

Abstract

The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.

HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

Abstract

The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.

Paper Structure

This paper contains 23 sections, 20 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: We propose HandDreamer: the first method for zero-shot 3D hand generation from text prompts. Our method generates high-fidelity, geometrically accurate 3D hand models with diverse articulations from text prompts. Existing methods generate Janus artifacts (HiFA, ESD) and fewer details (OHTA, DreamDPO, HumanNorm) (g). Surface maps provided inset.
  • Figure 2: Convergence into wrong modes. (a) Probable modes for same view point. (b) Random initialization can converge into different modes for same viewpoints leading to Janus artifacts. (c,d) Visualization of gradients for the same viewpoint with multiple timesteps ($t$). Random initialization leads to less informative gradients at lower $t$ and diverse gradients leading to view-inconsistencies at higher $t$. MANO initialization yields consistent gradients at lower and higher $t$.
  • Figure 3: Overview of HandDreamer. Our method generates 3D hand models from text prompts in 2 stages: (a) Hand shape initialization using MANO mesh; (b) Hand model generation using skeleton and Corrective Hand Shape (CHS) guidance loss.
  • Figure 4: Our method generates 3D hand models with detailed texture and view-consistent geometry. Surface maps provided inset
  • Figure 5: Comparison against state-of-the-art text-to-3D methods. Janus artifacts and inconsistent fingers shown in red arrows and circle (a-c). Text-to-human methods (d-f) generates hands with very less details. Our method generates better 3D hand models with consistent geometry and details.
  • ...and 16 more figures