Table of Contents
Fetching ...

Fit-NGP: Fitting Object Models to Neural Graphics Primitives

Marwan Taher, Ignacio Alzugaray, Andrew J. Davison

TL;DR

Fit-NGP introduces a fully automatic RGB-only pipeline for precise 6-DoF pose estimation of known 3D object models by leveraging the density field produced by Instant-NGP as an intermediate representation of the scene. A multi-hypothesis optimization aligns CAD or reconstructed object models to the Instant-NGP density field, using both surface and normal-derived points to define a differentiable fitness objective, and refining poses with AdamW. The approach achieves millimetre-level translation accuracy and a few degrees of rotation on small, reflective objects within roughly two minutes, and scales to multiple objects in a scene. This work demonstrates the viability of neural density fields as practical intermediates for high-precision robotic manipulation with a single RGB camera, offering robustness to lighting and material challenges while remaining automatic and reproducible.

Abstract

Accurate 3D object pose estimation is key to enabling many robotic applications that involve challenging object interactions. In this work, we show that the density field created by a state-of-the-art efficient radiance field reconstruction method is suitable for highly accurate and robust pose estimation for objects with known 3D models, even when they are very small and with challenging reflective surfaces. We present a fully automatic object pose estimation system based on a robot arm with a single wrist-mounted camera, which can scan a scene from scratch, detect and estimate the 6-Degrees of Freedom (DoF) poses of multiple objects within a couple of minutes of operation. Small objects such as bolts and nuts are estimated with accuracy on order of 1mm.

Fit-NGP: Fitting Object Models to Neural Graphics Primitives

TL;DR

Fit-NGP introduces a fully automatic RGB-only pipeline for precise 6-DoF pose estimation of known 3D object models by leveraging the density field produced by Instant-NGP as an intermediate representation of the scene. A multi-hypothesis optimization aligns CAD or reconstructed object models to the Instant-NGP density field, using both surface and normal-derived points to define a differentiable fitness objective, and refining poses with AdamW. The approach achieves millimetre-level translation accuracy and a few degrees of rotation on small, reflective objects within roughly two minutes, and scales to multiple objects in a scene. This work demonstrates the viability of neural density fields as practical intermediates for high-precision robotic manipulation with a single RGB camera, offering robustness to lighting and material challenges while remaining automatic and reproducible.

Abstract

Accurate 3D object pose estimation is key to enabling many robotic applications that involve challenging object interactions. In this work, we show that the density field created by a state-of-the-art efficient radiance field reconstruction method is suitable for highly accurate and robust pose estimation for objects with known 3D models, even when they are very small and with challenging reflective surfaces. We present a fully automatic object pose estimation system based on a robot arm with a single wrist-mounted camera, which can scan a scene from scratch, detect and estimate the 6-Degrees of Freedom (DoF) poses of multiple objects within a couple of minutes of operation. Small objects such as bolts and nuts are estimated with accuracy on order of 1mm.
Paper Structure (15 sections, 2 equations, 8 figures, 1 table)

This paper contains 15 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A set of posed images (top-left) from a scene containing multiple objects is used to train a NeRF. The reconstructed density field (bottom-right) is employed to align model-to-scene poses for each object using multi-hypothesis optimisation, with the best pose silhouette reprojection overlaid (magenta).
  • Figure 2: Overview of the proposed framework: an Instant-NGP reconstruction is obtained from images captured from a robot's wrist-mounted camera. Objects of interest are segmented from a reference view, and a depth map rendering from the same view is used to initialise a set of per-object pose hypotheses. Each hypothesis is optimised finding the best pose alignment using the Instant-NGP's density field.
  • Figure 3: Section of a reconstructed density field within object. Note that, while the RGB renders from a NeRF can achieve high-fidelity, the underlying density field can be as noise as shown here, even after NeRF convergence. This noisy density field is used in our fitness function Eq. (\ref{['eq:alignment']}), promoting that points near the surface of the aligned model object $\mathcal{X}^\mathcal{S}$ and points along their normal $\mathcal{X}^\mathcal{N}$ fall within high-density or low-density region, respectively.
  • Figure 4: Self-collected real-world datasets where the poses of all the relevant objects are accurately estimated, as evidenced by the re-projection of the silhouettes of their models perfectly aligning in the image plane.
  • Figure 5: Example of failure case of the proposed system. Instant-NGP is able to produce photorealistic RGB renderings (top left) even when the quality of the underlying density field is poor (top right). Despite this, object poses can still be retrieved provided that pose hypothesis initialisation is sufficiently close to the optima (bottom left), resulting in failure otherwise (bottom right).
  • ...and 3 more figures