Table of Contents
Fetching ...

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Abhinav Valada, Thomas Kollar

TL;DR

CARTO addresses the problem of reconstructing multiple unknown articulated objects from a single stereo image by learning a category- and joint-agnostic latent space for shape ($z_{s}$) and articulation ($z_{j}$) and a two-part decoder (geometry via DeepSDF and a joint state predictor). It combines this with a stereo RGB encoder to perform single-shot detection, pose estimation, and shape/articulation reconstruction for all detected objects, without category-specific decoders. The approach demonstrates competitive canonical-reconstruction performance and substantial improvements over two-stage baselines in full-pipeline experiments, including a roughly $20.4\%$ absolute gain in $mAP_{3DIOU50}$ on novel instances, while running at about $1$ Hz on a GPU for up to eight objects. Despite training on synthetic data, CARTO shows transfer to real-world scenes and provides a practical, fast pathway for real-time articulated object understanding in robotics and AR/VR contexts; limitations include reliance on a learned shape prior and current restriction to a single joint, with future work extending to arbitrary-graph kinematics and test-time adaptation.

Abstract

We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

TL;DR

CARTO addresses the problem of reconstructing multiple unknown articulated objects from a single stereo image by learning a category- and joint-agnostic latent space for shape () and articulation () and a two-part decoder (geometry via DeepSDF and a joint state predictor). It combines this with a stereo RGB encoder to perform single-shot detection, pose estimation, and shape/articulation reconstruction for all detected objects, without category-specific decoders. The approach demonstrates competitive canonical-reconstruction performance and substantial improvements over two-stage baselines in full-pipeline experiments, including a roughly absolute gain in on novel instances, while running at about Hz on a GPU for up to eight objects. Despite training on synthetic data, CARTO shows transfer to real-world scenes and provides a practical, fast pathway for real-time articulated object understanding in robotics and AR/VR contexts; limitations include reliance on a learned shape prior and current restriction to a single joint, with future work extending to arbitrary-graph kinematics and test-time adaptation.

Abstract

We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de
Paper Structure (27 sections, 19 equations, 11 figures, 8 tables)

This paper contains 27 sections, 19 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Visualization of CARTO on unseen object instances. We first use CARTO to jointly detect all objects in the scene and then articulate them while keeping the predicted shape code constant.
  • Figure 2: Overview of our proposed method. We first encode a stereo image using laskey2021simnet to predict a depth and an importance value, a pose as well as a shape and joint code for each pixel using peak detection on the depth map allows us to detect objects which then can be reconstructed given the latent code. Last, to place the objects in camera frame we transform the reconstructed point cloud using the predicted poses at the peaks. On the right side we show the position of the predicted shape codes in a t-SNE visualization of the learned shape codes for the training used as input to our single category- and joint-agnostic decoder. We additionally project each categories mean shape code and use them to reconstruct the objects at the average prismatic and revolute joint state in the training set.
  • Figure 3: Intuition for Latent Space Regularization. Our main idea is that the joint codes of two similarly articulated objects should be close. We define the similarity first through the joint type $\textit{jt}$ and second through an exponential distance measure of the joint state $q$. Here, the laptop (a) and the oven (b) have a revolute joint and are similarly wide open around $30\degree$. Compared to that, table (d) has a prismatic joint and thus should not be close to the revolute instances. Contrary to that, the dishwasher (c), has a revolute joint but is opened much more than the other revolute objects and therefore, should be relatively close. The visualization shows a lower dimensional projection of our learned latent joint space trained using our regularization.
  • Figure S.1: Encoder Architecture based on laskey2021simnet
  • Figure S.2: Decoder Architecture. The numbers indicate the size of the respective feature vector. Each arrow represents a layer of a multi-layer perceptron. For the geometry decoder, except for the first layer, the input to a layer always has a size of 512. The output dimensions vary depending on auxiliary inputs.
  • ...and 6 more figures