Table of Contents
Fetching ...

DreamArt: Generating Interactable Articulated Objects from a Single Image

Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, Siyuan Huang

TL;DR

DreamArt tackles the challenge of generating interactable articulated 3D assets from a single image. It introduces a three-stage pipeline: part-aware 3D object generation with mask-guided segmentation and amodal completion, articulation video synthesis using movable-part masks and amodal cues, and joint estimation with a differentiable texture refinement to realize plausible motion. The approach demonstrates state-of-the-art performance in articulation video synthesis and video-conditioned asset generation, with strong generalization to in-the-wild images. This work enables scalable production of high-fidelity, manipulable assets for embodied AI, AR/VR, and robotics.

Abstract

Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.

DreamArt: Generating Interactable Articulated Objects from a Single Image

TL;DR

DreamArt tackles the challenge of generating interactable articulated 3D assets from a single image. It introduces a three-stage pipeline: part-aware 3D object generation with mask-guided segmentation and amodal completion, articulation video synthesis using movable-part masks and amodal cues, and joint estimation with a differentiable texture refinement to realize plausible motion. The approach demonstrates state-of-the-art performance in articulation video synthesis and video-conditioned asset generation, with strong generalization to in-the-wild images. This work enables scalable production of high-fidelity, manipulable assets for embodied AI, AR/VR, and robotics.

Abstract

Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.

Paper Structure

This paper contains 20 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Method Overview. Our three-stage pipeline first reconstructs complete, segmented part meshes from a single image. Next, it synthesizes plausible articulation videos using amodal images and part masks as prompts. Finally, it optimizes joint parameters and refines texture maps for enhanced realism.
  • Figure 2: Qualitative comparison of synthesized articulation videos. We present qualitative results on both in-domain and in-the-wild data. Our method consistently outperforms the baselines by producing clearer and more plausible articulation, particularly in multi-part object scenarios.
  • Figure 3: Visualizations on asset synthesis.DreamArt shows clearer images with more plausible articulations than baselines, especially under novel views.
  • Figure 4: Ablation on amodal images. The inclusion of amodal images leads to more plausible articulation generation.
  • Figure 5: PartRM performs well on in-domain data but generalizes poorly to in-the-wild data.
  • ...and 3 more figures