Table of Contents
Fetching ...

Generic Objects as Pose Probes for Few-shot View Synthesis

Zhirui Gao, Renjiao Yi, Chenyang Zhu, Ke Zhuang, Wei Chen, Kai Xu

TL;DR

The paper addresses the challenge of reconstructing NeRFs from extremely few unposed views by introducing PoseProbe, which leverages everyday objects as pose probes segmented by Grounded-SAM. It presents a dual-branch architecture combining an object NeRF with a hybrid SDF representation and a scene NeRF, with incremental PnP-based pose initialization and joint optimization guided by geometric constraints, multi-view consistency, and feature-based alignment. The approach demonstrates state-of-the-art pose estimation and novel-view synthesis across ShapeScene, ToyDesk, DTU, and Replica, particularly excelling in sparse-view and large-baseline scenarios where COLMAP struggles, and shows robustness to probe choice. These results suggest practical applicability for real-world scenes lacking pose priors, enabling accurate rendering and geometry recovery with minimal input views.

Abstract

Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as "pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: \href{https://zhirui-gao.github.io/PoseProbe.github.io/}{this https URL}

Generic Objects as Pose Probes for Few-shot View Synthesis

TL;DR

The paper addresses the challenge of reconstructing NeRFs from extremely few unposed views by introducing PoseProbe, which leverages everyday objects as pose probes segmented by Grounded-SAM. It presents a dual-branch architecture combining an object NeRF with a hybrid SDF representation and a scene NeRF, with incremental PnP-based pose initialization and joint optimization guided by geometric constraints, multi-view consistency, and feature-based alignment. The approach demonstrates state-of-the-art pose estimation and novel-view synthesis across ShapeScene, ToyDesk, DTU, and Replica, particularly excelling in sparse-view and large-baseline scenarios where COLMAP struggles, and shows robustness to probe choice. These results suggest practical applicability for real-world scenes lacking pose priors, enabling accurate rendering and geometry recovery with minimal input views.

Abstract

Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as "pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: \href{https://zhirui-gao.github.io/PoseProbe.github.io/}{this https URL}
Paper Structure (34 sections, 20 equations, 16 figures, 9 tables)

This paper contains 34 sections, 20 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Our method addresses pose estimation and NeRF-based reconstruction in the challenging few-view setting (only 3 unposed views). Most NeRF-based approaches are initialized from COLMAP poses. However, in sparse regimes, COLMAP often fails to initialize, making it challenging for pose optimization in a state-of-the-art method SPARF truong2023sparf to work well. A state-of-the-art COLMAP-free pipeline CF-3DGS Fu_2024_CVPR also struggles in sparse-view scenarios. We propose spotting generic objects as "pose probes" in the scene (a face-shaped toy in this example), it achieves realistic novel-view renderings and accurately reconstructs geometry using only 3 input images.
  • Figure 2: Method overview. Our approach utilizes generic objects as pose probes for few-view inputs, with their masks automatically segmented using Grounded-SAM ren2024grounded with text prompts. The pose probe is initialized as a cuboid and then used to estimate the initial poses of input images via PnP incrementally. Our pipeline employs a dual-branch volume rendering framework to optimize camera poses and scene representation jointly. In the object NeRF branch, a hybrid SDF representation models the object geometry, enforcing constraints such as deformation regularization, multi-view consistency, and rendering loss. The scene branch optimizes the entire scene within an implicit radiance field. We refine camera poses simultaneously, incorporating constraints like rendering loss, multi-view consistency, and distribution regularization, yielding precise pose estimation.
  • Figure 3: Overview of the hybrid SDF representation. For a given point $\bm{p}$, the implicit deformation field predicts a deformation vector $\bm{v}$ and a scalar correction $\Delta s$. The point position is first deformed to $\bm{p}^{\prime} = \bm{p} + \bm{v}$, and $s'$ is queried from the template field. The final SDF value $s$ is computed by a non-linear scale mapping function $S$ to the sum of $s'$ and $\Delta s$.
  • Figure 4: The adaptive tuning of the training parameters $\beta$ and $\gamma$ in the SDF mapping function is demonstrated across different scenes in the DTU dataset. The parameters vary from scene to scene, with each scene represented by a distinct colored curve.
  • Figure 5: Illustration of multi-view geometric consistency and ray distance loss. Multi-view geometric consistency ensures accurate alignment of corresponding points across multiple views by minimizing the reprojection error, while the ray distance loss regularizes the minimal distance between camera rays and the surface of the pose probe. Together, these contribute to improved scene reconstruction and camera pose estimation.
  • ...and 11 more figures