URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li; Xiang Bai; Jieyu Zhang; Zhuangzhe Wu; Che Xu; Ying Li; Chengkai Hou; Shanghang Zhang

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, Shanghang Zhang

TL;DR

URDF-Anything introduces an end-to-end framework that reconstructs functional URDF digital twins of articulated objects directly from visual observations by leveraging a 3D Multimodal Large Language Model equipped with a dynamic [SEG] token. The method jointly infers part-level segmentation and kinematic parameters, ensuring geometric and symbolic consistency, and demonstrates superior segmentation accuracy, joint parameter prediction, and physical executability on PartNet-Mobility with strong generalization to unseen objects. A key contribution is the [SEG] mechanism, which tightly couples segmentation with autoregressive URDF prediction via cross-attention, enabling robust end-to-end reconstruction from 3D point clouds. The work advances robotic simulation and sim-to-real transfer by providing an efficient, end-to-end pathway from raw observations to executable URDFs, while acknowledging limitations such as missing physical properties and reliance on a mesh-conversion step.

Abstract

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized $[SEG]$ token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17\% improvement), kinematic parameter prediction (average error reduction of 29\%), and physical executability (surpassing baselines by 50\%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

TL;DR

Abstract

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)