Table of Contents
Fetching ...

ArtLLM: Generating Articulated Assets via 3D LLM

Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu

TL;DR

ArtLLM is introduced, a novel framework for generating high-quality articulated assets directly from complete 3D meshes that significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects.

Abstract

Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.

ArtLLM: Generating Articulated Assets via 3D LLM

TL;DR

ArtLLM is introduced, a novel framework for generating high-quality articulated assets directly from complete 3D meshes that significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects.

Abstract

Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
Paper Structure (29 sections, 10 equations, 9 figures, 4 tables)

This paper contains 29 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: We propose ArtLLM, a novel framework capable of rapidly generating articulation assets from images or text. By using a 3D LLM to jointly predict part layouts and joints, and integrating state-of-the-art part generation methods, our approach can produce high-quality, physically grounded articulation assets.
  • Figure 2: Architecture Overview. Given an input point cloud, ArtLLM first predicts a tokenized articulation blueprint that specifies part layouts and kinematic structures. This blueprint then conditions a part-aware generative model to synthesize high-fidelity link geometries, followed by a physics-based joint-limit correction module refines the articulation, producing simulation-ready articulated assets.
  • Figure 3: Physical limit calcualtion. Illustration for our physical based limit correction process.
  • Figure 4: Qualitative Comparison. Baseline methods rely on retrieving parts from a fixed asset library, hence often fail to recover accurate geometry and frequently generate incorrect articulations with mismatched joint types or misaligned joint positions. In contrast, our approach produces geometry that closely matches the input and recovers correct, coherent articulations.
  • Figure 5: Qualitative result for physical limit correction. Before correction, the predicted joint ranges cause noticeable self-collisions during articulation. After applying our physics-based limit refinement, the articulated parts move smoothly without collision, yielding physically plausible and stable motion.
  • ...and 4 more figures