Annotated Hands for Generative Models
Yue Yang, Atith N Gandhi, Greg Turk
TL;DR
Generative models struggle to reliably render hands. The paper introduces a model-agnostic training augmentation that adds three hand-specific annotation channels to RGB images, guiding both a StyleGAN2 GAN and a diffusion model to learn hand structure. It builds synthetic (MANO-based) and real (Onehand10k, Mediapipe) hand datasets and evaluates with Mediapipe-based metrics, including Mediapipe confidence and Mean Joint Ratio Difference. Results show that using six-channel inputs improves hand realism across models, with notable gains in detection confidence and joint-structure fidelity, though FID trade-offs depend on dataset context; limitations include dataset size and compute. The approach suggests broader applicability to other high-DOF objects and possible integration with large diffusion models.
Abstract
Generative models such as GANs and diffusion models have demonstrated impressive image generation capabilities. Despite these successes, these systems are surprisingly poor at creating images with hands. We propose a novel training framework for generative models that substantially improves the ability of such systems to create hand images. Our approach is to augment the training images with three additional channels that provide annotations to hands in the image. These annotations provide additional structure that coax the generative model to produce higher quality hand images. We demonstrate this approach on two different generative models: a generative adversarial network and a diffusion model. We demonstrate our method both on a new synthetic dataset of hand images and also on real photographs that contain hands. We measure the improved quality of the generated hands through higher confidence in finger joint identification using an off-the-shelf hand detector.
