Table of Contents
Fetching ...

iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos

Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva

TL;DR

This work tackles the practical problem of building interactable digital twins for articulated objects from casually captured RGBD videos. It introduces iTACO, a coarse-to-fine pipeline that first estimates rough joint parameters and movable-part segmentation and then refines these estimates through gradient-based optimization against a surface-point-cloud representation, guided by a moving-map and automatic part segmentation. A large synthetic dataset of 284 objects across 11 categories (784 videos) plus real RGBD sequences demonstrates that iTACO outperforms state-of-the-art baselines on both articulation parameter estimation and geometric reconstruction. The approach is designed to be general, not reliant on external libraries or fine-tuning, and offers a scalable path toward practical digital twins for robotics and embodied AI.

Abstract

Articulated objects are prevalent in daily life. Interactable digital twins of such objects have numerous applications in embodied AI and robotics. Unfortunately, current methods to digitize articulated real-world objects require carefully captured data, preventing practical, scalable, and generalizable acquisition. We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones. However, this setting is challenging due to simultaneous object and camera motion and significant occlusions as the person interacts with the object. To tackle these challenges, we introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a dataset of 784 videos containing 284 objects across 11 categories that is 20$\times$ larger than available in prior work. We then compare our approach with existing methods that also take video as input. Our experiments show that iTACO outperforms existing articulated object digital twin methods on both synthetic and real casually captured RGBD videos.

iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos

TL;DR

This work tackles the practical problem of building interactable digital twins for articulated objects from casually captured RGBD videos. It introduces iTACO, a coarse-to-fine pipeline that first estimates rough joint parameters and movable-part segmentation and then refines these estimates through gradient-based optimization against a surface-point-cloud representation, guided by a moving-map and automatic part segmentation. A large synthetic dataset of 284 objects across 11 categories (784 videos) plus real RGBD sequences demonstrates that iTACO outperforms state-of-the-art baselines on both articulation parameter estimation and geometric reconstruction. The approach is designed to be general, not reliant on external libraries or fine-tuning, and offers a scalable path toward practical digital twins for robotics and embodied AI.

Abstract

Articulated objects are prevalent in daily life. Interactable digital twins of such objects have numerous applications in embodied AI and robotics. Unfortunately, current methods to digitize articulated real-world objects require carefully captured data, preventing practical, scalable, and generalizable acquisition. We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones. However, this setting is challenging due to simultaneous object and camera motion and significant occlusions as the person interacts with the object. To tackle these challenges, we introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a dataset of 784 videos containing 284 objects across 11 categories that is 20 larger than available in prior work. We then compare our approach with existing methods that also take video as input. Our experiments show that iTACO outperforms existing articulated object digital twin methods on both synthetic and real casually captured RGBD videos.

Paper Structure

This paper contains 36 sections, 2 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: We propose iTACO: a coarse-to-fine framework for building interactable digital twins of articulated objects from a casually captured RGBD video. Our pipeline first predicts coarse joint parameters and movable part segmentations. The initial estimates are then refined with gradient-based optimization.
  • Figure 2: An overview of our coarse prediction pipeline. We first use feature matching in the static regions to estimate relative camera poses and align all observations to the same coordinate. Then, we compute the rigid transformation using feature matching in the dynamic regions to estimate joint parameters. Finally, we average out all the results to produce a joint parameter estimation.
  • Figure 3: An overview of our refinement pipeline. We transform observations in the video back to the initial stage with camera poses and joint parameters. We then compute the chamfer distance from the transformed observation to the object surface as a loss function and optimize relevant parameters.
  • Figure 4: Qualitative results on synthetic data. The moving parts of the object are shown in orange and static parts are in blue. Prismatic joints are green and revolute joints are red. To illustrate the joint state prediction results, we render the articulated object at the end state.
  • Figure 5: Qualitative results on real data. We find that Articulate Anything struggles with objects that do not exist in the mesh library, such as books. RSRD does not work well on textureless objects such as the cabinet example.
  • ...and 5 more figures