iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos
Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva
TL;DR
This work tackles the practical problem of building interactable digital twins for articulated objects from casually captured RGBD videos. It introduces iTACO, a coarse-to-fine pipeline that first estimates rough joint parameters and movable-part segmentation and then refines these estimates through gradient-based optimization against a surface-point-cloud representation, guided by a moving-map and automatic part segmentation. A large synthetic dataset of 284 objects across 11 categories (784 videos) plus real RGBD sequences demonstrates that iTACO outperforms state-of-the-art baselines on both articulation parameter estimation and geometric reconstruction. The approach is designed to be general, not reliant on external libraries or fine-tuning, and offers a scalable path toward practical digital twins for robotics and embodied AI.
Abstract
Articulated objects are prevalent in daily life. Interactable digital twins of such objects have numerous applications in embodied AI and robotics. Unfortunately, current methods to digitize articulated real-world objects require carefully captured data, preventing practical, scalable, and generalizable acquisition. We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones. However, this setting is challenging due to simultaneous object and camera motion and significant occlusions as the person interacts with the object. To tackle these challenges, we introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a dataset of 784 videos containing 284 objects across 11 categories that is 20$\times$ larger than available in prior work. We then compare our approach with existing methods that also take video as input. Our experiments show that iTACO outperforms existing articulated object digital twin methods on both synthetic and real casually captured RGBD videos.
