Table of Contents
Fetching ...

Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

Agniv Sharma, Xianghui Xie, Tom Fischer, Eddy Ilg, Gerard Pons-Moll

Abstract

Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.

Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

Abstract

Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
Paper Structure (34 sections, 4 equations, 12 figures, 5 tables)

This paper contains 34 sections, 4 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Given detailed text descriptions of human, object and their interactions, Hoi3DGen generates high quality textured human and object meshes that follow precisely the contact semantics, together with an aligned animatable SMPL model.
  • Figure 2: HOI3D framework overview.Top: We first leverage the existing multimodal foundation model InternVL chen2024internvl2 to perform decomposed annotation of human, object, and human-object-interaction of samples from the ProciGen xie2023template_free dataset. We then use LLaMa grattafiori2024llama3 to create a final detailed caption for the sample. Bottom: We leverage our data consisting of high-quality and diverse human-object-interactions to fine-tune an existing text-to-image model. Subsequently, we establish a pipeline to reconstruct high-fidelity textured 3D meshes. The output of our final text-to-3D inference pipeline consists of segmented meshes for the human and object, as well as an animatable SMPL model.
  • Figure 3: Analysis of the CLIP score. While our model clearly generates images that follow input interaction descriptions more precisely than SANA xie2024sana, the CLIP score indicates the opposite, rendering it unusable as a metric for our task.
  • Figure 4: Qualitative comparison for text to 3D generation. InterFusion dai2024interfusion is based on Score Distillation Sampling and hence is slow and produces low-quality 3D due to the well-known Janus problem. TRELLIS xiang2024trellis is a learning based native 3D generation method, hence it can produce better 3D but is not interaction-aware. Our method faithfully follows the text prompts, especially the detailed body contact specifications. Our contacts are highlighted with spheres coloured based on contacting body parts.
  • Figure 5: Animation results. Our fitted SMPL and segmented objects allow reanimation of the generated human object interaction mesh
  • ...and 7 more figures