CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Abhinav Valada, Thomas Kollar
TL;DR
CARTO addresses the problem of reconstructing multiple unknown articulated objects from a single stereo image by learning a category- and joint-agnostic latent space for shape ($z_{s}$) and articulation ($z_{j}$) and a two-part decoder (geometry via DeepSDF and a joint state predictor). It combines this with a stereo RGB encoder to perform single-shot detection, pose estimation, and shape/articulation reconstruction for all detected objects, without category-specific decoders. The approach demonstrates competitive canonical-reconstruction performance and substantial improvements over two-stage baselines in full-pipeline experiments, including a roughly $20.4\%$ absolute gain in $mAP_{3DIOU50}$ on novel instances, while running at about $1$ Hz on a GPU for up to eight objects. Despite training on synthetic data, CARTO shows transfer to real-world scenes and provides a practical, fast pathway for real-time articulated object understanding in robotics and AR/VR contexts; limitations include reliance on a learned shape prior and current restriction to a single joint, with future work extending to arbitrary-graph kinematics and test-time adaptation.
Abstract
We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de
