Table of Contents
Fetching ...

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Chuanrui Zhang, Yonggen Ling, Minglei Lu, Minghan Qin, Haoqian Wang

TL;DR

CODERS addresses category-level 3D object understanding under diverse surface properties by leveraging stereo imagery to resolve depth scale ambiguity. The method combines an Implicit Stereo Matching module with a Transformer decoder to jointly predict object category, 6D pose, and 3D shape in a single end-to-end pipeline, using a category-level SDF-based shape representation. It achieves state-of-the-art results on the TOD dataset and demonstrates strong generalization to unseen category-level instances in real-world robot manipulation, aided by the SS3D synthetic dataset and contrastive shape embeddings. The work highlights the potential of stereo-based multi-task perception for manipulation and provides datasets, code, and demos to foster further research.

Abstract

We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images. The base of our pipeline is an implicit stereo matching module that combines stereo image features with 3D position information. Concatenating this presented module and the following transform-decoder architecture leads to end-to-end learning of multiple tasks required by robot manipulation. Our approach significantly outperforms all competing methods in the public TOD dataset. Furthermore, trained on simulated data, CODERS generalize well to unseen category-level object instances in real-world robot manipulation experiments. Our dataset, code, and demos will be available on our project page.

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

TL;DR

CODERS addresses category-level 3D object understanding under diverse surface properties by leveraging stereo imagery to resolve depth scale ambiguity. The method combines an Implicit Stereo Matching module with a Transformer decoder to jointly predict object category, 6D pose, and 3D shape in a single end-to-end pipeline, using a category-level SDF-based shape representation. It achieves state-of-the-art results on the TOD dataset and demonstrates strong generalization to unseen category-level instances in real-world robot manipulation, aided by the SS3D synthetic dataset and contrastive shape embeddings. The work highlights the potential of stereo-based multi-task perception for manipulation and provides datasets, code, and demos to foster further research.

Abstract

We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images. The base of our pipeline is an implicit stereo matching module that combines stereo image features with 3D position information. Concatenating this presented module and the following transform-decoder architecture leads to end-to-end learning of multiple tasks required by robot manipulation. Our approach significantly outperforms all competing methods in the public TOD dataset. Furthermore, trained on simulated data, CODERS generalize well to unseen category-level object instances in real-world robot manipulation experiments. Our dataset, code, and demos will be available on our project page.
Paper Structure (24 sections, 8 equations, 9 figures, 8 tables)

This paper contains 24 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Estimations of CODERS on Unseen Objects in Real-world Scenarios. (a) The left view of the input stereo images with various object surface properties (diffuse, specular, transparent and mixed); (b) Estimated object categories, 6D poses, and sizes; (c) Estimated object shapes; (d) The back-projection of the reconstructed shapes onto the left input image.
  • Figure 2: Visualization of RGBD measurements in Real-world Experiments (a) The left view of stereo images. (b) The side view of the obtained colored point cloud. The figure displays the RGB and depth maps of transparent objects, such as the cup and bottle, represented by blue rectangles. Polished plastic objects exhibiting high reflection, the bowl and the mug, are indicated by blue rectangles. The black rectangles represent steel objects, such as the knife, that exhibit susceptibility to specular reflection. Depth measurements of all these objects exhibit both incompleteness and inaccuracies, which limits the performance of RGBD methods. Zoom-in is recommended.
  • Figure 3: Overview of Our Proposed CODERS. We present a single-stage network capable of processing multiple unknown objects, outputting detections, classes, 6D poses and 3D shapes concurrently. Using stereo images as input, our network generates stereo-aware features for easier alignment in implicit feature space. During the transformer decoder stage, object queries interact with 3D stereo-aware features, yielding object embeddings. These object embeddings are used to infer the category, pose and shape of objects using corresponding modules, which serve as the final output. In the Implicit Stereo Matching module, CT denotes coordinate transformer.
  • Figure 4: Ilustration of Stereo Position Encoding Function. The stereo features are initially dimension aligned with implicit feature space. Simultaneously, the global 3D coordinates are transformed into stereo 3D position embeddings using coordinate encoder(MLP network). These stereo 3D position embeddings are then fused with the aligned stereo feature to generate stereo-aware features.
  • Figure 5: Visualization of StereoPose and CODERS on TOD Dataset. (a) Left view image. (b) Right view image. (c) Results of StereoPose. (d) Results of CODERS. (e) Ground-truths. Our method surpasses stereopose in predicting location, size, and rotation.
  • ...and 4 more figures