Table of Contents
Fetching ...

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang

Abstract

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Abstract

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.
Paper Structure (22 sections, 3 equations, 8 figures, 6 tables)

This paper contains 22 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: SIMART leverages the multimodal reasoning of MLLMs to unify URDF generation and semantic part grounding, transforming static 3D meshes into functional, simulation-ready articulated assets.
  • Figure 2: The pipeline of our SIMART. The framework first encodes 3D geometry into a compact representation using the Sparse 3D VQ-VAE to minimize token redundancy while preserving critical surface details. These geometric tokens are then fused with visual and textual inputs through a unified MLLM backbone to perform part grounding and joint parameter estimation. The final output consists of structured URDF metadata and decomposed segments, enabling deployment into physics-based simulators and interactive robotic environments.
  • Figure 3: Architectural overview of the Sparse 3D VQ-VAE for high-fidelity geometric encoding. The pipeline employs a 3D-UNet voxel encoder to map geometric inputs into a discrete latent space through vector quantization with a specialized codebook.
  • Figure 4: Qualitative comparison of articulated asset generation across different methods. Each object is visualized in two motion states to demonstrate kinematic accuracy and geometric fidelity. While existing generative baselines often produce simplified or misaligned meshes, SIMART achieves precise part-level segmentation and superior structural consistency, providing high-fidelity assets that closely match the ground-truth configurations.
  • Figure 5: Qualitative comparison of part grounding capabilities under descriptions for AI-generated objects. The results demonstrate that SIMART precisely identifies and isolates functional components such as lids and doors, maintaining superior geometric consistency with the ground truth.
  • ...and 3 more figures