Table of Contents
Fetching ...

NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models

Francesco Milano, Jen Jen Chung, Hermann Blum, Roland Siegwart, Lionel Ott

TL;DR

The paper tackles 6D object pose estimation without CAD models or PBR synthetic data by introducing NeuSurfEmb, which builds a NeuS2-based object representation from a small real-image set and trains SurfEmb-based dense 2D-3D correspondences using NeuS2 renderings and cut-and-paste augmentation. It integrates an SfM- and segmentation-driven object model with a neural implicit surface, enabling photorealistic data generation without CADs. The method achieves competitive accuracy on LINEMOD-Occlusion against CAD-based methods and demonstrates robustness to mild occlusions in real-world objects, outperforming prior CAD-model-free approaches. The authors provide an open-source pipeline to facilitate adoption in robotics and related fields.

Abstract

State-of-the-art approaches for 6D object pose estimation assume the availability of CAD models and require the user to manually set up physically-based rendering (PBR) pipelines for synthetic training data generation. Both factors limit the application of these methods in real-world scenarios. In this work, we present a pipeline that does not require CAD models and allows training a state-of-the-art pose estimator requiring only a small set of real images as input. Our method is based on a NeuS2 object representation, that we learn through a semi-automated procedure based on Structure-from-Motion (SfM) and object-agnostic segmentation. We exploit the novel-view synthesis ability of NeuS2 and simple cut-and-paste augmentation to automatically generate photorealistic object renderings, which we use to train the correspondence-based SurfEmb pose estimator. We evaluate our method on the LINEMOD-Occlusion dataset, extensively studying the impact of its individual components and showing competitive performance with respect to approaches based on CAD models and PBR data. We additionally demonstrate the ease of use and effectiveness of our pipeline on self-collected real-world objects, showing that our method outperforms state-of-the-art CAD-model-free approaches, with better accuracy and robustness to mild occlusions. To allow the robotics community to benefit from this system, we will publicly release it at https://www.github.com/ethz-asl/neusurfemb.

NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models

TL;DR

The paper tackles 6D object pose estimation without CAD models or PBR synthetic data by introducing NeuSurfEmb, which builds a NeuS2-based object representation from a small real-image set and trains SurfEmb-based dense 2D-3D correspondences using NeuS2 renderings and cut-and-paste augmentation. It integrates an SfM- and segmentation-driven object model with a neural implicit surface, enabling photorealistic data generation without CADs. The method achieves competitive accuracy on LINEMOD-Occlusion against CAD-based methods and demonstrates robustness to mild occlusions in real-world objects, outperforming prior CAD-model-free approaches. The authors provide an open-source pipeline to facilitate adoption in robotics and related fields.

Abstract

State-of-the-art approaches for 6D object pose estimation assume the availability of CAD models and require the user to manually set up physically-based rendering (PBR) pipelines for synthetic training data generation. Both factors limit the application of these methods in real-world scenarios. In this work, we present a pipeline that does not require CAD models and allows training a state-of-the-art pose estimator requiring only a small set of real images as input. Our method is based on a NeuS2 object representation, that we learn through a semi-automated procedure based on Structure-from-Motion (SfM) and object-agnostic segmentation. We exploit the novel-view synthesis ability of NeuS2 and simple cut-and-paste augmentation to automatically generate photorealistic object renderings, which we use to train the correspondence-based SurfEmb pose estimator. We evaluate our method on the LINEMOD-Occlusion dataset, extensively studying the impact of its individual components and showing competitive performance with respect to approaches based on CAD models and PBR data. We additionally demonstrate the ease of use and effectiveness of our pipeline on self-collected real-world objects, showing that our method outperforms state-of-the-art CAD-model-free approaches, with better accuracy and robustness to mild occlusions. To allow the robotics community to benefit from this system, we will publicly release it at https://www.github.com/ethz-asl/neusurfemb.
Paper Structure (16 sections, 4 figures, 5 tables)

This paper contains 16 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed method. Starting from a set of reference images $\{\mathbf{I}_i\}$ around the object of interest, and using Structure-from-Motion and a pipeline based on Segment Anything Kirillov2023SAM and the object tracker MixFormer Cui2022MixFormer to estimate corresponding camera poses $\{\mathbf{P}_i\}$ and object masks $\{\tilde{\mathbf{I}}_i\}$, we construct an object model and synthesized dataset by training a NeuS2 Wang2023NeuS2 model and generating renderings from novel views $\{\mathbf{P}^\textrm{syn}_i\}$ (yellow boxes). We use the generated object model and synthesized dataset, augmented online using cut-and-paste Dwibedi2017CutPasteLearn to simulate occlusions and background variations, to learn feature-based dense 2D-3D correspondences based on SurfEmb Haugaard2022SurfEmb (green box). We then estimate the object pose in a test image by sampling correspondences based on the learned object features and the predicted image features and using PnP with RANSAC and pose refinement (purple box).
  • Figure 2: Example NeuS2 reconstructions (shown as textured point cloud), overlaid on the point cloud sampled from the CAD model (shown in green) on the objects from LINEMOD-Occlusion. Next to the object names we report the forward Chamfer distance with respect to the CAD model.
  • Figure 3: Example images captured for model construction in the real-world experiments (top row) and corresponding NeuS2 reconstructions (bottom row). The objects depicted from left to right are: $\mathrm{bluebox}$, $\mathrm{extinguisher}$, $\mathrm{greybox}$, $\mathrm{helmet}$, $\mathrm{kettle}$.
  • Figure 4: Example visualizations of the poses estimated by the different methods in the real-world experiments, displayed as rendered coordinates and reprojected object bounding box overlaid to the original image. The scenes depicted from left to right are: $\mathrm{extinguisher}$, $\mathrm{greybox}$, $\mathrm{kettle}$, $\mathrm{bluebox} - \mathrm{helmet}$, $\mathrm{greybox} - \mathrm{kettle}$, $\mathrm{helmet} - \mathrm{extinguisher}$ (cf. Tables \ref{['tab:results_real_world_experiments_no_occlusion']} and \ref{['tab:results_real_world_experiments_with_occlusions']}).