Table of Contents
Fetching ...

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

Junho Park, Kyeongbo Kong, Suk-Ju Kang

TL;DR

AttentionHand tackles the problem of 3D hand mesh reconstruction in the wild by enabling text-driven controllable hand image generation that is aligned with 3D hand labels. It introduces a diffusion-based framework with a dual-stage conditioning pipeline: a Text Attention Stage (TAS) that focuses on hand-related tokens and a Visual Attention Stage (VAS) that fuses global and local hand mesh cues. The method leverages latent embeddings from four inputs (global RGB, global hand mesh, bounding box, and text prompt) and demonstrates state-of-the-art performance in text-to-hand image generation while substantially improving 3D hand reconstruction when used to augment training data. This approach reduces domain gaps between indoor and outdoor scenes and provides a scalable means to generate diverse, accurately annotated in-the-wild hand images for downstream perception tasks.

Abstract

Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

TL;DR

AttentionHand tackles the problem of 3D hand mesh reconstruction in the wild by enabling text-driven controllable hand image generation that is aligned with 3D hand labels. It introduces a diffusion-based framework with a dual-stage conditioning pipeline: a Text Attention Stage (TAS) that focuses on hand-related tokens and a Visual Attention Stage (VAS) that fuses global and local hand mesh cues. The method leverages latent embeddings from four inputs (global RGB, global hand mesh, bounding box, and text prompt) and demonstrates state-of-the-art performance in text-to-hand image generation while substantially improving 3D hand reconstruction when used to augment training data. This approach reduces domain gaps between indoor and outdoor scenes and provides a scalable means to generate diverse, accurately annotated in-the-wild hand images for downstream perception tasks.

Abstract

Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.
Paper Structure (45 sections, 13 equations, 20 figures, 5 tables)

This paper contains 45 sections, 13 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Various acquisition types of 3D hand datasets. (a) In-the-wild dataset (i.e., MSCOCO lin2014microsoft) is naively acquired with inaccurate pseudo annotation, (b) relighted dataset (i.e., Re:InterHand moon2023dataset) consists of unnatural hands with inharmonious background, and (c) our in-the-wild dataset from AttentionHand, which is annotated with accurate 3D labels, contains natural hands with harmonious background, easy to generate, and can be made infinitely.
  • Figure 2: Visualization of attention maps with corresponding tokens from given text prompts. Red and green boxes represent attention maps without and with AttentionHand, respectively.
  • Figure 3: Overall pipeline of AttentionHand. In the data preparation phase, we prepare global and local RGB images, global and local hand mesh images, bounding box, and text prompt. In the encoding phase, we get global and local latent image embeddings through VQ-GAN esser2021taming, and text embedding through CLIP radford2021learning. In the conditioning phase, we refine image embeddings through the text attention stage, and obtain the diffusion feature through the visual attention stage. In the decoding phase, we generate a new hand image $\hat{I}_{RGB}$ from $Y_d$ through VQ-GAN.
  • Figure 4: Overall process of the text attention stage (TAS). By leveraging the hand-related tagging and refinement, we can highlight hand-related attention maps, which leads to update noisy embeddings with $\mathcal{L}^{TAS}$.
  • Figure 5: Overall process of the visual attention stage (VAS). By utilizing the global and local information, we can obtain the harmonious diffusion feature, which leads to generate high-fidelity hand images.
  • ...and 15 more figures