Table of Contents
Fetching ...

Interact3D: Compositional 3D Generation of Interactive Objects

Hui Shan, Keyang Luo, Ming Li, Sizhe Zheng, Yanwei Fu, Zhen Chen, Xiangru Huang

Abstract

Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.

Interact3D: Compositional 3D Generation of Interactive Objects

Abstract

Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.
Paper Structure (35 sections, 5 equations, 15 figures, 2 tables)

This paper contains 35 sections, 5 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Given a text prompt and a user-given mesh, Interact3D synthesizes a high-quality, geometrically-compatible complementary mesh. It then seamlessly composes these two assets into an interactive 3D scene. Unlike existing approaches, our framework automatically generates collision-aware and physically-sound 3D environments.
  • Figure 2: Given a generated scene, PartField directly segments it to extract individual 3D assets, whereas Interact3D leverages it as 3D spatial guidance to compose high-quality independent geometries. (Though generating scenes and meshes via separate TRELLIS2 inferences yields minor texture variations, their spatial guidance remains robust to these discrepancies.)
  • Figure 3: Overview of Interact3D. Given a user-provided mesh $\mathbf{M}$, we render it and use Nano Banana Pro to synthesize a guided scene image $I_\mathrm{scene}$ and a complementary image $I_\mathrm{comp}$. TRELLIS2 then reconstructs these into 3D meshes ($\mathbf{M_\mathrm{scene}}$ and $\mathbf{M_\mathrm{comp}}$) to provide spatial guidance. Relying on coarse parts segmented by PartField, we execute a two-stage composition. Stage 1 performs a global-to-local registration on the "largest" mesh to establish the anchor pose, while Stage 2 applies an SDF-based collision-aware optimization on the remaining mesh to resolve spatial intersections. (In this case, $\mathbf{M_\mathrm{comp}}$ is considered as $\mathbf{M_\mathrm{anchor}}$, $\mathbf{M}$ is considered as $\mathbf{M_\mathrm{remain}}$.) Finally, a VLM-based agentic refinement handles unavoidable collisions, yielding a physically-sound interactive 3D scene.
  • Figure 4: Agentic Refinement. When generating flowers inside a vase, severe occlusion in image $I_\mathrm{scene}$ cause the lost of object-object spatial relationships during TRELLIS2 reconstruction. As a result, the flower stem in the complementary image $I_\mathrm{comp}$ (generated by Nano Banana Pro) may not align correctly with the vase mesh $\textbf{M}$. This leads to geometric intersections after composition (see top-right image). To resolve this, we render muli-view images, including internal cross-sections. These renderings are analyzed by a VLM, which generates a corrective text prompt to update the complementary image via Nano Banana Pro. This process continues until no more geometric intersections are found or the maximum iteration limit is reached.
  • Figure 5: More than two parts composition results. Given a cabinet mesh, we render an image from it, add three pairs of shoes to the image using Nano Banana Pro and generate it with TRELLIS2 to obtain the object-object spatial relationships (OOR). Next, we use Nano Banana Pro again to extract each individual object in the shoe cabinet and generate them with TRELLIS2. Finally, using the OOR information, we sequentially add them to the cabinet mesh to form an interactive 3D scene .
  • ...and 10 more figures