Table of Contents
Fetching ...

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu, Lan Xu

TL;DR

Kinematify tackles the problem of open-vocabulary synthesis of high-DoF articulated objects from static inputs such as RGB images or text. It presents a three-part pipeline: part-aware 3D reconstruction to form a segmented mesh, MCTS-driven kinematic topology inference guided by a multi-term reward, and DW-CAVL optimization to estimate joint parameters on static geometry before exporting a URDF. The method demonstrates improved kinematic-tree fidelity and joint parameter accuracy over prior work across everyday objects and robotic platforms, and it shows practical viability by enabling end-to-end pipelines and real-world robot manipulation tasks. By enabling zero-shot articulation synthesis from open inputs, Kinematify advances scalable, physics-aware modeling of complex articulated systems for manipulation, simulation, and planning.

Abstract

A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

TL;DR

Kinematify tackles the problem of open-vocabulary synthesis of high-DoF articulated objects from static inputs such as RGB images or text. It presents a three-part pipeline: part-aware 3D reconstruction to form a segmented mesh, MCTS-driven kinematic topology inference guided by a multi-term reward, and DW-CAVL optimization to estimate joint parameters on static geometry before exporting a URDF. The method demonstrates improved kinematic-tree fidelity and joint parameter accuracy over prior work across everyday objects and robotic platforms, and it shows practical viability by enabling end-to-end pipelines and real-world robot manipulation tasks. By enabling zero-shot articulation synthesis from open inputs, Kinematify advances scalable, physics-aware modeling of complex articulated systems for manipulation, simulation, and planning.

Abstract

A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.

Paper Structure

This paper contains 30 sections, 15 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of Kinematify. A part-aware 3D foundation model first reconstructs a segmented digital twin. Then, the kinematic tree is recovered via Monte Carlo Tree Search (MCTS) driven by rewards for structure, stability, contact, symmetry, and hierarchy. Finally, joint types are predicted by a vision language model (VLM), and joint parameters are optimized on the parent link’s signed distance field (SDF) to enforce contact consistency and avoid collisions.
  • Figure 2: Pipeline of Kinematify for recovering articulated robots from a single RGB image. Step 1: A 3D foundation model generates a segmented mesh of the robot. Step 2: A contact graph is constructed over mesh parts, capturing candidate relations between components. Step 3: Infer the kinematic tree using MCTS, resolving ambiguous connections by leveraging structural priors such as hierarchy and symmetry. Step 4: Refine joint parameters using the DW-CAVL optimization approach while preserving near-contact geometry. Bottom row: Examples of inferred revolute joints with optimized axes.
  • Figure 3: Examples of articulated objects generated by Kinematify. Each row shows different objects across a sequence of joint configurations.
  • Figure 4: Qualitative comparison of articulation recovery on everyday objects across three methods: Kinematify (ours), Articulate Anymesh, and ArtGS. The red line indicates the joint direction.
  • Figure 5: Demonstration of Kinematify on two high-DoF robots: Unitree Go2 (12 DoF, left) and Unitree H1 (19 DoF, right). For each case, the pipeline starts from a segmented mesh, followed by kinematic tree inference and joint parameter optimization.
  • ...and 1 more figures