Table of Contents
Fetching ...

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Kefan Chen, Chaerin Min, Linguang Zhang, Shreyas Hampali, Cem Keskin, Srinath Sridhar

TL;DR

FoundHand addresses the challenge of realistic hand generation by building a large-scale, domain-specific diffusion model trained on FoundHand-10M, a dataset of 10M hand images annotated with 2D keypoints and segmentation masks. It treats generation as a two-frame image-to-image diffusion task conditioned on 2D keypoint heatmaps, enabling precise pose and camera-view control without full 3D supervision. Key contributions include the FoundHand-10M dataset, a 2D keypoint-conditioned latent diffusion model with multi-modal alignment, and core capabilities such as gesture transfer, domain transfer, novel view synthesis, plus zero-shot hand fixing and hand-object video synthesis. The approach demonstrates state-of-the-art performance and strong generalization to in-the-wild scenarios, with practical impact on hand-centric graphics, AR/VR avatars, and robotics.

Abstract

Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

TL;DR

FoundHand addresses the challenge of realistic hand generation by building a large-scale, domain-specific diffusion model trained on FoundHand-10M, a dataset of 10M hand images annotated with 2D keypoints and segmentation masks. It treats generation as a two-frame image-to-image diffusion task conditioned on 2D keypoint heatmaps, enabling precise pose and camera-view control without full 3D supervision. Key contributions include the FoundHand-10M dataset, a 2D keypoint-conditioned latent diffusion model with multi-modal alignment, and core capabilities such as gesture transfer, domain transfer, novel view synthesis, plus zero-shot hand fixing and hand-object video synthesis. The approach demonstrates state-of-the-art performance and strong generalization to in-the-wild scenarios, with practical impact on hand-centric graphics, AR/VR avatars, and robotics.

Abstract

Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.

Paper Structure

This paper contains 26 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We present FoundHand, a domain-specific image generation model that can synthesize realistic single and dual hand images. FoundHand is trained on our large-scale FoundHand-10M dataset which contains automatically extracted 2D keypoints and segmentation mask annotations (top left). FoundHand is formulated as a 2D pose-conditioned image-to-image diffusion model that enables precise hand pose and camera viewpoint control (top right). Optionally, we can condition the generation with a reference image to preserve its style (top right). Our model demonstrates exceptional in-the-wild generalization across hand-centric applications and has core capabilities such as gesture transfer, domain transfer, and novel view synthesis (middle row). This endows FoundHand with zero-shot applications to fix malformed hand images and synthesize coherent hand and hand-object videos, without explicitly giving object cues (bottom row).
  • Figure 2: (Left) During training, we randomly sample two frames from a video sequence or two different views of a frame as the reference and target frame and encode them using a pretrained VAE as the latent diffusion model. We concatenate the encoded image features $z$ with keypoint heatmaps $\mathcal{H}$ and hand mask $\mathcal{M}$ and encode them with a shared-weight embedder to acquire spatially-aligned feature patches before feeding to transformer with 3D self-attention. The target hand mask is set to $\emptyset$ since it is not required at test time. $y$ indicates if the two frames are from the synchronized views.
  • Figure 3: Given reference images (top), our model transforms hands to target poses (visualized with ours at bottom) while faithfully preserving appearance details such as fingernails and textures. The model demonstrates generalization across diverse visual domains, from photorealistic images to artistic paintings, maintaining both anatomical plausibility and appearance fidelity.
  • Figure 4: Given a synthetic hand dataset, FoundHand can transform it to the in-the-wild domain with realistic appearance and background, improving existing 3D hand estimation after finetuning on our generated data.
  • Figure 5: From a single input image (1st column), FoundHand generates diverse viewpoints, demonstrating robust generalization to unseen hands and camera poses.
  • ...and 3 more figures