Table of Contents
Fetching ...

Annotated Hands for Generative Models

Yue Yang, Atith N Gandhi, Greg Turk

TL;DR

Generative models struggle to reliably render hands. The paper introduces a model-agnostic training augmentation that adds three hand-specific annotation channels to RGB images, guiding both a StyleGAN2 GAN and a diffusion model to learn hand structure. It builds synthetic (MANO-based) and real (Onehand10k, Mediapipe) hand datasets and evaluates with Mediapipe-based metrics, including Mediapipe confidence and Mean Joint Ratio Difference. Results show that using six-channel inputs improves hand realism across models, with notable gains in detection confidence and joint-structure fidelity, though FID trade-offs depend on dataset context; limitations include dataset size and compute. The approach suggests broader applicability to other high-DOF objects and possible integration with large diffusion models.

Abstract

Generative models such as GANs and diffusion models have demonstrated impressive image generation capabilities. Despite these successes, these systems are surprisingly poor at creating images with hands. We propose a novel training framework for generative models that substantially improves the ability of such systems to create hand images. Our approach is to augment the training images with three additional channels that provide annotations to hands in the image. These annotations provide additional structure that coax the generative model to produce higher quality hand images. We demonstrate this approach on two different generative models: a generative adversarial network and a diffusion model. We demonstrate our method both on a new synthetic dataset of hand images and also on real photographs that contain hands. We measure the improved quality of the generated hands through higher confidence in finger joint identification using an off-the-shelf hand detector.

Annotated Hands for Generative Models

TL;DR

Generative models struggle to reliably render hands. The paper introduces a model-agnostic training augmentation that adds three hand-specific annotation channels to RGB images, guiding both a StyleGAN2 GAN and a diffusion model to learn hand structure. It builds synthetic (MANO-based) and real (Onehand10k, Mediapipe) hand datasets and evaluates with Mediapipe-based metrics, including Mediapipe confidence and Mean Joint Ratio Difference. Results show that using six-channel inputs improves hand realism across models, with notable gains in detection confidence and joint-structure fidelity, though FID trade-offs depend on dataset context; limitations include dataset size and compute. The approach suggests broader applicability to other high-DOF objects and possible integration with large diffusion models.

Abstract

Generative models such as GANs and diffusion models have demonstrated impressive image generation capabilities. Despite these successes, these systems are surprisingly poor at creating images with hands. We propose a novel training framework for generative models that substantially improves the ability of such systems to create hand images. Our approach is to augment the training images with three additional channels that provide annotations to hands in the image. These annotations provide additional structure that coax the generative model to produce higher quality hand images. We demonstrate this approach on two different generative models: a generative adversarial network and a diffusion model. We demonstrate our method both on a new synthetic dataset of hand images and also on real photographs that contain hands. We measure the improved quality of the generated hands through higher confidence in finger joint identification using an off-the-shelf hand detector.
Paper Structure (26 sections, 2 equations, 14 figures, 5 tables)

This paper contains 26 sections, 2 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Hand Skeleton.
  • Figure 2: First Annotation Channel.
  • Figure 3: Second Annotation Channel.
  • Figure 4: (a) Left hand and its annotation channels. (b) Right hand and its annotation channels.
  • Figure 5: Pipeline for generating annotated hands. To create synthetic hand images, we pass hand size and pose information to the MANO hand model. We then render an image from the model and compose it in front of a background. For real hands, we utilize the Onehand10k dataset and rely on Mediapipe to generate annotations.
  • ...and 9 more figures