Table of Contents
Fetching ...

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, Shanghang Zhang

TL;DR

URDF-Anything introduces an end-to-end framework that reconstructs functional URDF digital twins of articulated objects directly from visual observations by leveraging a 3D Multimodal Large Language Model equipped with a dynamic [SEG] token. The method jointly infers part-level segmentation and kinematic parameters, ensuring geometric and symbolic consistency, and demonstrates superior segmentation accuracy, joint parameter prediction, and physical executability on PartNet-Mobility with strong generalization to unseen objects. A key contribution is the [SEG] mechanism, which tightly couples segmentation with autoregressive URDF prediction via cross-attention, enabling robust end-to-end reconstruction from 3D point clouds. The work advances robotic simulation and sim-to-real transfer by providing an efficient, end-to-end pathway from raw observations to executable URDFs, while acknowledging limitations such as missing physical properties and reliance on a mesh-conversion step.

Abstract

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized $[SEG]$ token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17\% improvement), kinematic parameter prediction (average error reduction of 29\%), and physical executability (surpassing baselines by 50\%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

TL;DR

URDF-Anything introduces an end-to-end framework that reconstructs functional URDF digital twins of articulated objects directly from visual observations by leveraging a 3D Multimodal Large Language Model equipped with a dynamic [SEG] token. The method jointly infers part-level segmentation and kinematic parameters, ensuring geometric and symbolic consistency, and demonstrates superior segmentation accuracy, joint parameter prediction, and physical executability on PartNet-Mobility with strong generalization to unseen objects. A key contribution is the [SEG] mechanism, which tightly couples segmentation with autoregressive URDF prediction via cross-attention, enabling robust end-to-end reconstruction from 3D point clouds. The work advances robotic simulation and sim-to-real transfer by providing an efficient, end-to-end pathway from raw observations to executable URDFs, while acknowledging limitations such as missing physical properties and reliance on a mesh-conversion step.

Abstract

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17\% improvement), kinematic parameter prediction (average error reduction of 29\%), and physical executability (surpassing baselines by 50\%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

Paper Structure

This paper contains 29 sections, 5 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: URDF-Anything: Generating Functional URDF Digital Twins from Visual Observations(single or multi-view images). Our framework, utilizing a 3D Multimodal Large Language Model and guided by instructions (e.g., "Segment parts and predict parameters"), processes the point cloud to jointly infer geometric part segmentation and kinematic structure. The output is a segmented 3D model with defined joints (represented here by different part colors), forming a functional URDF digital twin directly usable in physics simulators.
  • Figure 2: Overview of the URDF-Anything Framework. The pipeline takes a 3D point cloud (from image) and a structured language instruction as input. The 3D MLLM(fine-tuned with LoRA) autoregressively generates symbolic output (kinematic parameters) and $[SEG]$ tokens. The embeddings corresponding to the generated $[SEG]$ tokens then interact with the point cloud features via a 3D Decoder to perform fine-grained geometric segmentation of the point cloud into individual links. Finally, the jointly predicted kinematic parameters and the segmented geometry are integrated into a functional URDF file, resulting in a complete articulated 3D model ready for physics simulation.
  • Figure 3: Qualitative Comparison of Articulated Object Reconstruction Results. The top row displays the input image for various articulated object instances (each column represents a different object). We can find that baseline methods frequently struggle in predicting incorrect object types, generating distorted geometry, or exhibiting significant errors in link placement, leading to misaligned or incorrect structures.
  • Figure 4: SAPIENS Simulator Rendering Strategies
  • Figure 5: LGM: Point Cloud Generation via Multi-view Synthesis
  • ...and 4 more figures