Table of Contents
Fetching ...

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang

TL;DR

This model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis, which offers a more control-lable and realistic synthesis as it can specify the structure and style inputs in a disentangled manner.

Abstract

3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis, we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: https://mq-zhang1.github.io/HOIDiffusion

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

TL;DR

This model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis, which offers a more control-lable and realistic synthesis as it can specify the structure and style inputs in a disentangled manner.

Abstract

3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis, we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: https://mq-zhang1.github.io/HOIDiffusion
Paper Structure (18 sections, 3 equations, 10 figures, 7 tables)

This paper contains 18 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: (i) Left: Hand-object synthesis with Stable Diffusion model; (ii) Right: HOIDiffusion generates high-quality hand-object interaction images conditioned on physical structures and detailed text description. The model disentangles the geometry from appearance, exhibiting high generation diversity. Each row: We can fix the structure and control the style based on text inputs; Each column: We can fix the style and control the structure based on 3D structural inputs.
  • Figure 2: Pipeline. We propose a two-stage pipeline to synthesize hand-object-interaction data. During the first stage, we utilize a pretrained GrabNet to output 3D hand poses given by a single object model. Then in the second stage, we use those 3D hand poses along with segmentation maps, normal maps and skeletons to conditionally generate high-quality HOI data.
  • Figure 3: Model Figure. We inject three conditional encoders into the stable diffusion model. We utilize both the HOI datasets and high-quality background images to train HOIDiffusion. The background images are synthesized using the scenery prompts. The texts sent to the model are output by LLaVA for detailed description.
  • Figure 4: Qualitative results on different structures. Generated images with the same background description but different physical conditions (object shape, poses, and hand skeletons). With plain prompts, HOIDiffusion could generate more realistic images similar to the style in training datasets.
  • Figure 5: Synthesized images with diverse background descriptions. In addition to real-style synthesis, our model also allows users to generate according to their preferences such as science fiction or general landscapes.
  • ...and 5 more figures