Table of Contents
Fetching ...

sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

Arslan Artykov, Corentin Sautier, Vincent Lepetit

TL;DR

<3-5 sentence high-level summary> The paper introduces sim2art, a data-driven method for recovering joint parameters and part segmentation of articulated objects from a single monocular video captured with a moving camera, trained exclusively on synthetic data. It employs a Transformer-based architecture that processes temporally aligned point clouds with per-frame scene flow and DINOv3 semantic features, using Hungarian matching to handle variable numbers of parts. The approach generalizes from synthetic sequences to real-world objects and achieves state-of-the-art performance in both synthetic and real datasets, robust to 4D reconstruction artifacts. By enabling scalable synthetic-data training and direct monocular video inference, the method offers practical benefits for robotics, digital twins, and dynamic environment understanding.

Abstract

Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/

sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

TL;DR

<3-5 sentence high-level summary> The paper introduces sim2art, a data-driven method for recovering joint parameters and part segmentation of articulated objects from a single monocular video captured with a moving camera, trained exclusively on synthetic data. It employs a Transformer-based architecture that processes temporally aligned point clouds with per-frame scene flow and DINOv3 semantic features, using Hungarian matching to handle variable numbers of parts. The approach generalizes from synthetic sequences to real-world objects and achieves state-of-the-art performance in both synthetic and real datasets, robust to 4D reconstruction artifacts. By enabling scalable synthetic-data training and direct monocular video inference, the method offers practical benefits for robotics, digital twins, and dynamic environment understanding.

Abstract

Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/

Paper Structure

This paper contains 25 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We introduce a method that robustly and accurately recovers the joint parameters, part segmentation, and motion magnitudes of an articulated object from a single video---all trained entirely on synthetic data only.
  • Figure 2: Method Pipeline. Our method takes as input a sequences of images, from which we obtain the masks of the objects, the depth maps, and the camera parameters. We sample 2D points over the masks, lift them to 3D, and augment them with their scene flows and DINOv3 features. From this input, we predict the part segmentations, joint parameters for each part, and amounts of motions for each part and each time step.
  • Figure 3: Extracted points in two images. The points do not correspond to the same physical points in general.
  • Figure 4: Representative instances from all the categories used in our dataset.
  • Figure 5: Qualitative results on samples from the synthetic dataset. Our method retrieves the joint parameters and segments the object parts accurately and robustly. Articulate-Anything can not handle multi-part object sequences, which are marked as $\times$.
  • ...and 1 more figures