Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Haozhuo Zhang; Bin Zhu; Yu Cao; Yanbin Hao

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao

TL;DR

This paper tackles the difficulty of generating anatomically accurate hands in text-to-image diffusion, where hands are often distorted by existing models. It introduces Hand1000, a three-stage training pipeline that leverages Mediapipe gesture features and embedding optimization to inject hand priors and gesture alignment into Stable Diffusion, using only 1,000 images per target gesture. A new HaGRID-based dataset enriched with gesture information via BLIP2, PaliGemma, VitGpt2, and LLaMA3 enables targeted hand generation, and Hand1000 demonstrates substantial improvements over baseline diffusion methods on hand-related metrics while preserving non-hand details. The approach combines gesture-feature extraction, double embedding fusion, and frozen-diffusion fine-tuning to achieve accurate hand visualization, offering a practical, data-efficient path for reliable hand generation in text-to-image tasks.

Abstract

Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 7 figures, 3 tables)

This paper contains 22 sections, 3 equations, 7 figures, 3 tables.

Introduction
Related Work
Text-to-Image Generation
Realistic Hand Generation
Text-based Image Editing
Method
Preliminaries
Stage I: Hand Gesture Feature Extraction
Stage II: Text Embedding Optimization
Stage III: Stable Diffusion Fine-tuning
Inference
Experiments
Dataset Construction
Evaluation Metrics
Implementation Details
...and 7 more sections

Figures (7)

Figure 1: Comparison of hand image generation results between Stable Diffusion and our Hand1000. Given the same text prompt, Stable Diffusion produces deformed and chaotic hands. In contrast, our proposed Hand1000 manages to generate anatomically correct and realistic hands while preserving details such as character, clothing, and colors.
Figure 2: The proposed Hand1000 is designed with a three-stage training process. In Stage I, the primary objective is to compute mean hand gesture feature from images. Stage II builds on this by concatenating the mean hand gesture feature obtained in Stage I with the corresponding text embeddings. These concatenated features are then mapped into a fused embedding, which is further enhanced by linearly fusing it with the original text embedding, resulting in a double-fused embedding. This embedding is optimized using a reconstruction loss through a frozen Stable Diffusion model, ensuring that the final embedding is well-optimized. Stage III involves fine-tuning the Stable Diffusion model for image generation, leveraging the frozen optimized embedding obtained from Stage II.
Figure 3: Overview of inference phase. First, text embedding is computed given a textual description as input. Next, the stored mean hand gesture feature is concatenated and fused with text embedding to obtain double-fused embedding. Finally, the double-fused embedding is fed into the trained diffusion model in Stage III to generate hand images.
Figure 4: The dataset construction begins with generating a textual description using an image captioning model (e.g., BLIP2) from image. The textual description, along with gesture labels, is then fed into the LLaMA3 model touvron2023open to produce a text description enriched with gesture label information.
Figure 5: Comparison of images in hand gesture of four fingers up generated by stable diffusion, fine-tuned stable diffusion, stable diffusion enhanced with Imagic kawar2023imagic, fine-tuned stable diffusion enhanced with Imagic kawar2023imagic, stable diffusion enhanced with HandRefiner lu2023handrefiner and our Hand1000.
...and 2 more figures

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

TL;DR

Abstract

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Authors

TL;DR

Abstract

Table of Contents

Figures (7)