Table of Contents
Fetching ...

NeRF-Feat: 6D Object Pose Estimation using Feature Rendering

Shishir Reddy Vutukur, Heike Brock, Benjamin Busam, Tolga Birdal, Andreas Hutter, Slobodan Ilic

TL;DR

This work proposes to use a NeRF to learn object shape implicitly which is later used to learn view-invariant features in conjunction with CNN using a contrastive loss, which is later used to estimate the pose in the reference frame of NeRF.

Abstract

Object Pose Estimation is a crucial component in robotic grasping and augmented reality. Learning based approaches typically require training data from a highly accurate CAD model or labeled training data acquired using a complex setup. We address this by learning to estimate pose from weakly labeled data without a known CAD model. We propose to use a NeRF to learn object shape implicitly which is later used to learn view-invariant features in conjunction with CNN using a contrastive loss. While NeRF helps in learning features that are view-consistent, CNN ensures that the learned features respect symmetry. During inference, CNN is used to predict view-invariant features which can be used to establish correspondences with the implicit 3d model in NeRF. The correspondences are then used to estimate the pose in the reference frame of NeRF. Our approach can also handle symmetric objects unlike other approaches using a similar training setup. Specifically, we learn viewpoint invariant, discriminative features using NeRF which are later used for pose estimation. We evaluated our approach on LM, LM-Occlusion, and T-Less dataset and achieved benchmark accuracy despite using weakly labeled data.

NeRF-Feat: 6D Object Pose Estimation using Feature Rendering

TL;DR

This work proposes to use a NeRF to learn object shape implicitly which is later used to learn view-invariant features in conjunction with CNN using a contrastive loss, which is later used to estimate the pose in the reference frame of NeRF.

Abstract

Object Pose Estimation is a crucial component in robotic grasping and augmented reality. Learning based approaches typically require training data from a highly accurate CAD model or labeled training data acquired using a complex setup. We address this by learning to estimate pose from weakly labeled data without a known CAD model. We propose to use a NeRF to learn object shape implicitly which is later used to learn view-invariant features in conjunction with CNN using a contrastive loss. While NeRF helps in learning features that are view-consistent, CNN ensures that the learned features respect symmetry. During inference, CNN is used to predict view-invariant features which can be used to establish correspondences with the implicit 3d model in NeRF. The correspondences are then used to estimate the pose in the reference frame of NeRF. Our approach can also handle symmetric objects unlike other approaches using a similar training setup. Specifically, we learn viewpoint invariant, discriminative features using NeRF which are later used for pose estimation. We evaluated our approach on LM, LM-Occlusion, and T-Less dataset and achieved benchmark accuracy despite using weakly labeled data.
Paper Structure (33 sections, 11 equations, 10 figures, 9 tables)

This paper contains 33 sections, 11 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Architecture: Our architecture comprises a Ray Generator, NeRF Block, and a U-Net CNN. Ray Generator generates rays from a specific viewpoint which are passed to NeRF Block as discrete 3D points, $x$, along the ray including ray direction, $r_{d}$. NeRF Block comprises three MLPs. Density MLP takes in a 3D coordinate, $x$, and predicts the corresponding density, $d$. Color MLP takes in the intermediate feature from Density MLP and Ray Direction, $r_{d}$ as input to predict color, $c$, at the point. Ray direction is added to color MLP to model view-dependent color changes. Feature MLP takes in a 3D point, $x$, to predict feature vector, $f$. Ray Integration accumulates densities, $d$, and color values, $c$, along a ray to get the final color, $C$. Similarly, density and feature values along a ray are accumulated to generate feature value, $G$, and silhouette value, $S$. Each ray corresponds to a pixel in an image. By generating rays for all the pixels, we can render our final image. Our CNN takes in the input image, $C^{\prime}$ corresponding to the same viewpoint and predicts the feature image $F$. We formulate a contrastive feature loss between feature images from NeRF and CNN. We train the orange blocks during stage 1 and freeze them during stage 2 when blue blocks are trained.
  • Figure 2: Visualization of learned feature representation of symmetric T-Less objects along with meshes
  • Figure 3: Correspondence Visualization for continuous symmetric object in T-Less. The Input image, the estimated mask, and the segmented images are shown in the first three images. 2D-3D Correspondences are visualized the next 3 images. Correspondences from 2D segmented region to the 3D point cloud are connected using lines. Blue point cloud is the full point cloud of the object. Red points are the matched correspondences to 2D points. The different views of the correspondence show that the correspondences are biased towards one symmetric configuration. Ideally, for a continuous symmetric object, the correspondences should have been distributed around the object. This bias towards one symmetric configuration helps us in performing inference faster as we can use naive PnP Ransac to estimate the final pose instead of the intensive render and compare inference employed in surfEmb to handle symmetric objects.
  • Figure 4: Visualization of mesh reconstructions of Can object in LM. The figure shows the original mesh, our reconstruction from NeRF using marching cubes, SoftRas refers to mesh optimized using SoftRas differentiable renderer. SoftRas mesh cannot reconstruct holes as it is optimized from genus zero sphere mesh.
  • Figure 5: Visualization of correspondences for object 28 in T-Less: Discrete symmetric object. The visualization shows the 2D-3D correspondences between 2D masked pixels and the 3D point cloud of the object. We join lines between matched 2D-3D correspondences. The segmented 2D image points are indicated with their RGB color from the image. The blue pointcloud indicates the full point cloud of the object. The red point cloud indicates the correspondence to the current image(indicated with masked pixels). We visualize the correspondences in different views to show where the 3D correspondences on the object are matched. The first row indicates the correspondence visualization for SurfEmb. The second row indicates the correspondences visualization of our approach. In SurfEmb, the correspondences are distributed around the object for the symmetric object. In our approach, the correspondences are biased towards only one symmetric configuration
  • ...and 5 more figures