Table of Contents
Fetching ...

PixTrack: Precise 6DoF Object Pose Tracking using NeRF Templates and Feature-metric Alignment

Prajwal Chidananda, Saurabh Nair, Douglas Lee, Adrian Kaehler

TL;DR

PixTrack addresses robust 6DoF object pose tracking from monocular RGB and RGB-D by representing the target object with an object-centered NeRF and using a PixLoc-style feature-metric optimization on novel-view renderings. The method synthesizes a reference view from the previous frame’s pose, computes multi-scale feature and depth residuals, and optimizes SE(3) updates without auxiliary pose networks or annotated trajectories. Data collection of Object-NeRF is performed with a turntable protocol, and the SfM pipeline leverages COLMAP with enhancements to yield accurate object geometry, which is then used to extract a clean Object-NeRF through NeRF differencing. Experimental results on YCB-Video show improved accuracy with depth information and demonstrate jitter-free, online tracking without annotation, while maintaining efficiency through caching and occlusion-aware masking. The work offers a practical, annotation-free, multi-object tracking framework that integrates NeRF-based canonical representations with feature-metric optimization for robust 6DoF pose tracking in real-world scenes.

Abstract

We present PixTrack, a vision based object pose tracking framework using novel view synthesis and deep feature-metric alignment. We follow an SfM-based relocalization paradigm where we use a Neural Radiance Field to canonically represent the tracked object. Our evaluations demonstrate that our method produces highly accurate, robust, and jitter-free 6DoF pose estimates of objects in both monocular RGB images and RGB-D images without the need of any data annotation or trajectory smoothing. Our method is also computationally efficient making it easy to have multi-object tracking with no alteration to our algorithm through simple CPU multiprocessing. Our code is available at: https://github.com/GiantAI/pixtrack

PixTrack: Precise 6DoF Object Pose Tracking using NeRF Templates and Feature-metric Alignment

TL;DR

PixTrack addresses robust 6DoF object pose tracking from monocular RGB and RGB-D by representing the target object with an object-centered NeRF and using a PixLoc-style feature-metric optimization on novel-view renderings. The method synthesizes a reference view from the previous frame’s pose, computes multi-scale feature and depth residuals, and optimizes SE(3) updates without auxiliary pose networks or annotated trajectories. Data collection of Object-NeRF is performed with a turntable protocol, and the SfM pipeline leverages COLMAP with enhancements to yield accurate object geometry, which is then used to extract a clean Object-NeRF through NeRF differencing. Experimental results on YCB-Video show improved accuracy with depth information and demonstrate jitter-free, online tracking without annotation, while maintaining efficiency through caching and occlusion-aware masking. The work offers a practical, annotation-free, multi-object tracking framework that integrates NeRF-based canonical representations with feature-metric optimization for robust 6DoF pose tracking in real-world scenes.

Abstract

We present PixTrack, a vision based object pose tracking framework using novel view synthesis and deep feature-metric alignment. We follow an SfM-based relocalization paradigm where we use a Neural Radiance Field to canonically represent the tracked object. Our evaluations demonstrate that our method produces highly accurate, robust, and jitter-free 6DoF pose estimates of objects in both monocular RGB images and RGB-D images without the need of any data annotation or trajectory smoothing. Our method is also computationally efficient making it easy to have multi-object tracking with no alteration to our algorithm through simple CPU multiprocessing. Our code is available at: https://github.com/GiantAI/pixtrack
Paper Structure (14 sections, 8 equations, 2 figures, 1 table)

This paper contains 14 sections, 8 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: PixTrack uses a Neural Radiance Field as the canonical representation of a given object and provides pixel-level accuracy in 6-DoF object tracking for monocular RGB (images on the left) or RGB-D (images on the right) sequences.
  • Figure 2: Localization technique used in PixTrack. 1. Input frame feature extraction: Given an input RGB frame, we extract deep features using a pre-trained CNN. 2. Reference frame feature extraction: Using the predicted pose from the previous frame, we render a novel view using a canonical object-NeRF. We use this view as the reference frame and extract deep features using the aforementioned pre-trained CNN. 3. Reference frame feature interpolation: We interpolate the reference frame features at 2D points that were obtained by projecting the 3D points from the SfM onto the reference image. 4. Feature-metric alignment: We iteratively optimize using an LM optimizer over SE(3) to obtain the pose of the object in the input frame