Table of Contents
Fetching ...

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang

TL;DR

SPARK addresses the challenge of creating simulation-ready articulated 3D assets from a single image by fusing vision-language model priors with a diffusion-transformer generator and differentiable geometry optimization. It produces part-level meshes plus complete URDF parameters, guided by per-part references and a structural graph, and refines joint attributes via differentiable forward kinematics and rendering under open-state supervision. The approach introduces multi-level attention, hierarchical parent–child guidance, rectified-flow training, and texture generation to ensure geometric fidelity and kinematic consistency. Empirical results on PartNet-Mobility demonstrate improved shape reconstruction quality and URDF accuracy, with ablations confirming the contributions and practical applicability to robotic manipulation tasks such as drawer-opening in simulated environments.

Abstract

Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling. Project page: https://heyumeng.com/SPARK/index.html.

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

TL;DR

SPARK addresses the challenge of creating simulation-ready articulated 3D assets from a single image by fusing vision-language model priors with a diffusion-transformer generator and differentiable geometry optimization. It produces part-level meshes plus complete URDF parameters, guided by per-part references and a structural graph, and refines joint attributes via differentiable forward kinematics and rendering under open-state supervision. The approach introduces multi-level attention, hierarchical parent–child guidance, rectified-flow training, and texture generation to ensure geometric fidelity and kinematic consistency. Empirical results on PartNet-Mobility demonstrate improved shape reconstruction quality and URDF accuracy, with ablations confirming the contributions and practical applicability to robotic manipulation tasks such as drawer-opening in simulated environments.

Abstract

Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling. Project page: https://heyumeng.com/SPARK/index.html.

Paper Structure

This paper contains 51 sections, 38 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: SPARK is a novel framework that integrates VLM-guided part-level and global image guidance with diffusion transformers to produce high-quality articulated object reconstructions.
  • Figure 2: Pipeline Overview. We use a VLM to generate per-part reference images, predicted open-state images, and URDF templates with preliminary joint and link estimations. A Diffusion Transformer (DiT) equipped with local, global, and hierarchical attention mechanisms simultaneously synthesizes part-level and complete articulated meshes from a single image with VLM priors. We further employ a generative texture model to generate realistic textures and refine the URDF parameters using differentiable forward kinematics and differentiable rendering under the guidance of the predicted open-state images.
  • Figure 3: Qualitative Comparison on Shape Reconstruction. We compare our results with OmniPart yang2025omnipart, PartCrafter lin2025partcrafter, and URDFormer chen2024urdformer. Our method fulfills accurate, high-fidelity articulated object shape reconstruction.
  • Figure 4: Qualitative Comparison on URDF Estimation. We compare our results with Articulate-Anything le2024articulate, Articulate-AnyMesh qiu2025articulate. The closed-state results are reconstructed or retrieved meshes, while the open-state configurations are obtained through kinematic transformations using the estimated URDF parameters. Our method achieves more accurate and physically consistent URDF estimation, leading to realistic articulation behavior.
  • Figure 5: In-the-wild image results. Additional examples of shape reconstruction and open-state prediction on in-the-wild images.
  • ...and 2 more figures