Table of Contents
Fetching ...

ArtiWorld: LLM-Driven Articulation of 3D Objects in Scenes

Yixuan Yang, Luyang Xie, Zhen Luo, Zixiang Zhao, Tongsheng Ding, Mingqi Gao, Feng Zheng

TL;DR

ArtiWorld introduces a scene-aware pipeline that automatically identifies articulable objects in 3D scenes and converts rigid assets into executable URDF-based articulated objects, preserving original geometry. The core Arti4URDF model embeds 3D point-cloud geometry into a large language model to infer inter-part relations, joint types, and kinematic parameters, generating both a JSON-style kinematic tree and a complete URDF. Trained on PartNet-Mobility and PhysXNet, and evaluated on object-, scene-, and real-world scans, ArtiWorld achieves state-of-the-art joint-type prediction and axis localization, with strong generalization to unseen categories and real-world data. The approach enables interactive, robot-ready simulation environments directly from existing 3D assets, facilitating scalable robot learning and data augmentation.

Abstract

Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.

ArtiWorld: LLM-Driven Articulation of 3D Objects in Scenes

TL;DR

ArtiWorld introduces a scene-aware pipeline that automatically identifies articulable objects in 3D scenes and converts rigid assets into executable URDF-based articulated objects, preserving original geometry. The core Arti4URDF model embeds 3D point-cloud geometry into a large language model to infer inter-part relations, joint types, and kinematic parameters, generating both a JSON-style kinematic tree and a complete URDF. Trained on PartNet-Mobility and PhysXNet, and evaluated on object-, scene-, and real-world scans, ArtiWorld achieves state-of-the-art joint-type prediction and axis localization, with strong generalization to unseen categories and real-world data. The approach enables interactive, robot-ready simulation environments directly from existing 3D assets, facilitating scalable robot learning and data augmentation.

Abstract

Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.

Paper Structure

This paper contains 19 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of ArtiWorld pipeline. Given a room-scale or tabletop-scale scene description, our system uses a language model to identify objects that should be articulated and retrieves their corresponding 3D assets. Point cloud surfaces are sampled from each object and encoded as geometric tokens. These tokens, together with a structured text prompt, are fed into Arti4URDF, which predicts joint types, axes, and articulation limits. The generated URDF models are then aligned back to their original scene positions, producing fully interactive articulated scenes suitable for simulation and downstream robotic tasks.
  • Figure 2: Overview of the Arti4URDF pipeline. Our Arti4URDF takes raw 3D objects and samples their surfaces to obtain point clouds for training and inference. A unified point cloud encoder extracts both global and local geometric features, which are mapped into the LLM embedding space through lightweight adapters. These features are injected into structured URDF prompts that describe part–joint relationships and articulation rules. The LLM-based Arti4URDF model then generates JSON-style structural descriptions and full URDF files, which can be used to produce articulated mesh or point cloud outputs that are ready for simulation and downstream robotic interaction.
  • Figure 3: Qualitative comparison of articulated object reconstruction results.
  • Figure 4: Qualitative results in simulated scenes and scanned real-world.