Table of Contents
Fetching ...

Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects

Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, Stan Birchfield

TL;DR

This work tackles digital-twin construction for unknown articulated objects using two RGB-D scans in different states. It introduces a two-stage approach: Stage 1 builds per-state neural implicit geometry via a neural object field to obtain meshes, while Stage 2 infers a multi-part articulation model through a probabilistic part segmentation and per-part rigid motions, supervised by a point correspondence field and losses that integrate 3D geometry, 2D image matches, and kinematics. The method supports arbitrary objects with multiple moving parts and no shape priors, showing improved robustness and accuracy over baselines such as Ditto and PARIS across synthetic and real data, including multi-part scenarios. This enables reliable digital twins for robotics and simulation, with implications for fast, category-agnostic articulation reconstruction from limited observations.

Abstract

We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images, 3D reconstructions, and kinematics, our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: https://github.com/NVlabs/DigitalTwinArt

Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects

TL;DR

This work tackles digital-twin construction for unknown articulated objects using two RGB-D scans in different states. It introduces a two-stage approach: Stage 1 builds per-state neural implicit geometry via a neural object field to obtain meshes, while Stage 2 infers a multi-part articulation model through a probabilistic part segmentation and per-part rigid motions, supervised by a point correspondence field and losses that integrate 3D geometry, 2D image matches, and kinematics. The method supports arbitrary objects with multiple moving parts and no shape priors, showing improved robustness and accuracy over baselines such as Ditto and PARIS across synthetic and real data, including multi-part scenarios. This enables reliable digital twins for robotics and simulation, with implications for fast, category-agnostic articulation reconstruction from limited observations.

Abstract

We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images, 3D reconstructions, and kinematics, our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: https://github.com/NVlabs/DigitalTwinArt
Paper Structure (25 sections, 22 equations, 9 figures, 5 tables)

This paper contains 25 sections, 22 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our method requires two RGB-D scans of the object in each of two articulation states (left). The output is a 3D reconstruction with parts segmented, joint types identified, and joint axes estimated (top right). Note that multiple joints are allowed. The resulting digital twin can be imported into a physics-based simulator for interaction (bottom right).
  • Figure 2: Overview of our method. In Stage 1, given multi-view RGB-D scans for the object at the initial and final articulation states, two neural object fields are optimized for each state. Upon learning convergence, the meshes corresponding to the two states are extracted. In Stage 2, the part segmentation field and per-part motions are optimized with three losses: consistency, matching, and collision. Together, the segmentation field and part motions yield point correspondence between the two states.
  • Figure 3: Motivation for collision loss. (a), (b) are the observations for the object at initial and final state respectively. Suppose the object is wrongly segmented as shown in (c), where blue represents the movable part. Moving the part with the forward motion will result in (d). In this case, wrong segmentation field still results in low consistency loss for SDF and color. Therefore, we introduce additional collision loss.
  • Figure 4: Illustration of the collision loss. We aim to detect and penalize collisions between parts after applying the predicted forward motion (moving the two sticks inwards). For point $\mathbf{y}$ at state $t'$, we backtrace a set of points $\widetilde{\operatorname{Bwd}}(\mathbf{y})$ ($\{a, b, c\}$) that may move to $\mathbf{y}$, by transforming $\mathbf{y}$ with each part's inverse motion (moving outwards following the arrows). We then check if the candidate point $\mathbf{x}_i$ obtained with part $i$'s motion is indeed a point in part $i$, by looking up its occupancy and part label. Finally, we obtain the set of points $\overleftarrow{\operatorname{Bwd}}(\mathbf{y})$($\{b, c\}$) that in fact map to $\mathbf{y}$ under the articulation model, and report collision if there are more than one point that maps to $\mathbf{y}$, i.e., $|\overleftarrow{\operatorname{Bwd}}(\mathbf{y})| > 1$.
  • Figure 5: Qualitative results of shape reconstruction, part segmentation and joint prediction on PARIS dataset liu2023paris. The top two rows correspond to synthetic data. The bottom row corresponds to real-world data. While PARIS and PARIS* occasionally work for these objects, depending upon the random seed, they often fail. Shown are the results from a typical trial that achieves near-average results.
  • ...and 4 more figures