Table of Contents
Fetching ...

Object Pose Estimation Using Implicit Representation For Transparent Objects

Varun Burde, Artem Moroz, Vit Zeman, Pavel Burget

TL;DR

It is shown that if the object is represented as an implicit (neural) representation in the form of Neural Radiance Field, it exhibits a more realistic rendering of the actual scene and retains the crucial spatial features, which makes the comparison more versatile.

Abstract

Object pose estimation is a prominent task in computer vision. The object pose gives the orientation and translation of the object in real-world space, which allows various applications such as manipulation, augmented reality, etc. Various objects exhibit different properties with light, such as reflections, absorption, etc. This makes it challenging to understand the object's structure in RGB and depth channels. Recent research has been moving toward learning-based methods, which provide a more flexible and generalizable approach to object pose estimation utilizing deep learning. One such approach is the render-and-compare method, which renders the object from multiple views and compares it against the given 2D image, which often requires an object representation in the form of a CAD model. We reason that the synthetic texture of the CAD model may not be ideal for rendering and comparing operations. We showed that if the object is represented as an implicit (neural) representation in the form of Neural Radiance Field (NeRF), it exhibits a more realistic rendering of the actual scene and retains the crucial spatial features, which makes the comparison more versatile. We evaluated our NeRF implementation of the render-and-compare method on transparent datasets and found that it surpassed the current state-of-the-art results.

Object Pose Estimation Using Implicit Representation For Transparent Objects

TL;DR

It is shown that if the object is represented as an implicit (neural) representation in the form of Neural Radiance Field, it exhibits a more realistic rendering of the actual scene and retains the crucial spatial features, which makes the comparison more versatile.

Abstract

Object pose estimation is a prominent task in computer vision. The object pose gives the orientation and translation of the object in real-world space, which allows various applications such as manipulation, augmented reality, etc. Various objects exhibit different properties with light, such as reflections, absorption, etc. This makes it challenging to understand the object's structure in RGB and depth channels. Recent research has been moving toward learning-based methods, which provide a more flexible and generalizable approach to object pose estimation utilizing deep learning. One such approach is the render-and-compare method, which renders the object from multiple views and compares it against the given 2D image, which often requires an object representation in the form of a CAD model. We reason that the synthetic texture of the CAD model may not be ideal for rendering and comparing operations. We showed that if the object is represented as an implicit (neural) representation in the form of Neural Radiance Field (NeRF), it exhibits a more realistic rendering of the actual scene and retains the crucial spatial features, which makes the comparison more versatile. We evaluated our NeRF implementation of the render-and-compare method on transparent datasets and found that it surpassed the current state-of-the-art results.

Paper Structure

This paper contains 15 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of the same object with different representations. On the left, there is a CAD model without any texture, in the middle there is the rendering of the same object using trained NeRF, and on the right there is the CAD model in the Blender scene. It can be seen there are big differences in their visual appearance.
  • Figure 2: Working pipeline of the render-and-compare using NeRF as implicit representation. 1) Illustrate the camera poses and images rendered from the Blender scene which is fed to train the NeRF (model-free setup) 2) Shows the input of the Pipeline for inference, our method needs RGB image with the tight bounding box annotations for each object 3) NeRF is optimized for the scene and postprocessed to remove background noise 4) Describes the coarse estimation of the pose by performing classification task on the sampled rendered views 5) Illustration of refiner block which iteratively refines the pose by adding $\Delta t$ and $\Delta r$ to the pose 6) Finally, the resulting pose, in the figure one can see the pose of the NeRF overlayed on the exact pose of the object
  • Figure 3: The following figure describes the visual appearance of various objects with the trained NeRF and their rendering from different views, the top row represents the textured box. The middle row is the rendering of transparent glass beakers, and at the bottom row, you can see the visualization of a metallic rod
  • Figure 4: Dataset for finetuning of the pose estimator to improve the performance on transparent objects. The objects have been augmented with transparency shader. For each scene a random floor background is set with four sources of light around the objects
  • Figure 5: Example images from the four benchmarked datasets. a) Test image from the HouseCat6D data set consisting of textured, shiny, metallic, and matte objects of different categories. b) The test image of the ClearPose dataset consists of glass utensils next to each other on the tabletop setup c) The TransPose test image consists of glassy and plastic objects with different optical properties cluttered on a tabletop setup d) The test image of DIMO dataset consists of colored, shiny, and matte finish metallic object on metallic surface
  • ...and 1 more figures