Table of Contents
Fetching ...

SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

Yuliang Guo, Abhinav Kumar, Cheng Zhao, Ruoyu Wang, Xinyu Huang, Liu Ren

TL;DR

SUP-NeRF addresses monocular 3D object reconstruction by unifying pose estimation with NeRF-based reconstruction in a detector-free, object-centric framework. It decouples object dimension estimation from pose refinement and introduces a camera-invariant projected-box representation, enabling robust pose updates in $SO(3)$ with translations in a relative space $\Delta T^{(t)}=[v^{(t)}_x,v^{(t)}_y,\rho^{(t)}]$. The approach uses an iterative, cross-task design that shares an image encoder across tasks while subtracting pose cues from shape/texture features, leading to improved generalization and SoTA results on nuScenes, with strong cross-dataset performance on KITTI and Waymo. The experiments reveal the method's advantages in reconstruction quality and pose accuracy, and show potential for real-time use by leveraging faster neural rendering backends, marking a significant step toward practical monocular 3D object reconstruction without external detectors.

Abstract

Monocular 3D reconstruction for categorical objects heavily relies on accurately perceiving each object's pose. While gradient-based optimization in a NeRF framework updates the initial pose, this paper highlights that scale-depth ambiguity in monocular object reconstruction causes failures when the initial pose deviates moderately from the true pose. Consequently, existing methods often depend on a third-party 3D object to provide an initial object pose, leading to increased complexity and generalization issues. To address these challenges, we present SUP-NeRF, a Streamlined Unification of object Pose estimation and NeRF-based object reconstruction. SUP-NeRF decouples the object's dimension estimation and pose refinement to resolve the scale-depth ambiguity, and introduces a camera-invariant projected-box representation that generalizes cross different domains. While using a dedicated pose estimator that smoothly integrates into an object-centric NeRF, SUP-NeRF is free from external 3D detectors. SUP-NeRF achieves state-of-the-art results in both reconstruction and pose estimation tasks on the nuScenes dataset. Furthermore, SUP-NeRF exhibits exceptional cross-dataset generalization on the KITTI and Waymo datasets, surpassing prior methods with up to 50\% reduction in rotation and translation error.

SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

TL;DR

SUP-NeRF addresses monocular 3D object reconstruction by unifying pose estimation with NeRF-based reconstruction in a detector-free, object-centric framework. It decouples object dimension estimation from pose refinement and introduces a camera-invariant projected-box representation, enabling robust pose updates in with translations in a relative space . The approach uses an iterative, cross-task design that shares an image encoder across tasks while subtracting pose cues from shape/texture features, leading to improved generalization and SoTA results on nuScenes, with strong cross-dataset performance on KITTI and Waymo. The experiments reveal the method's advantages in reconstruction quality and pose accuracy, and show potential for real-time use by leveraging faster neural rendering backends, marking a significant step toward practical monocular 3D object reconstruction without external detectors.

Abstract

Monocular 3D reconstruction for categorical objects heavily relies on accurately perceiving each object's pose. While gradient-based optimization in a NeRF framework updates the initial pose, this paper highlights that scale-depth ambiguity in monocular object reconstruction causes failures when the initial pose deviates moderately from the true pose. Consequently, existing methods often depend on a third-party 3D object to provide an initial object pose, leading to increased complexity and generalization issues. To address these challenges, we present SUP-NeRF, a Streamlined Unification of object Pose estimation and NeRF-based object reconstruction. SUP-NeRF decouples the object's dimension estimation and pose refinement to resolve the scale-depth ambiguity, and introduces a camera-invariant projected-box representation that generalizes cross different domains. While using a dedicated pose estimator that smoothly integrates into an object-centric NeRF, SUP-NeRF is free from external 3D detectors. SUP-NeRF achieves state-of-the-art results in both reconstruction and pose estimation tasks on the nuScenes dataset. Furthermore, SUP-NeRF exhibits exceptional cross-dataset generalization on the KITTI and Waymo datasets, surpassing prior methods with up to 50\% reduction in rotation and translation error.
Paper Structure (31 sections, 9 equations, 21 figures, 13 tables)

This paper contains 31 sections, 9 equations, 21 figures, 13 tables.

Figures (21)

  • Figure 1: Teaser. SUP-NeRF is a unified solution that predicts an object's pose, shape, and texture using a single network. SUP-NeRF is trained on real driving scenes with imprecise labels, and it adapts robustly to new cross-dataset scenarios.
  • Figure 2: Scale-depth ambiguity in NeRF. Given the input image (left), joint optimization of pose, shape, and texture in NeRF has full freedom to rescale the shape within the normalized shape space (blue box) or move the $3$D box. Such phenomenon is observed from the evolution of the rendered objects from iteration 0 (middle) to iteration 50 (right).
  • Figure 3: SUP-NeRF Overview. SUP-NeRF unifies pose estimation and NeRF. The pose estimation module enables SUP-NeRF to work for objects in diverse poses without external $3$D detectors. The increase of complexity only constitutes a few MLP layers.
  • Figure 4: SUP-NeRF Pose Estimation Module. The pose estimation module of SUP-NeRF iteratively updates the object's pose while preserving scale. It takes the projection of $3$D box corners as a visual representation of the input pose and estimate the pose update via comparing it to observed image in a latent embedding space. These designs handle scale-depth ambiguity and make the deep refiner independent from camera intrinsic parameters for better cross-domain generalization.
  • Figure 5: nuScenes Cross-view Evaluation. Each row shows a set of images of the same object from different angles. We use the example for both monocular pose estimation and NeRF reconstruction. Other images in each row are used to evaluate the reconstructed shape and texture in PSNR and Depth Error (DE).
  • ...and 16 more figures