Generic Objects as Pose Probes for Few-shot View Synthesis
Zhirui Gao, Renjiao Yi, Chenyang Zhu, Ke Zhuang, Wei Chen, Kai Xu
TL;DR
The paper addresses the challenge of reconstructing NeRFs from extremely few unposed views by introducing PoseProbe, which leverages everyday objects as pose probes segmented by Grounded-SAM. It presents a dual-branch architecture combining an object NeRF with a hybrid SDF representation and a scene NeRF, with incremental PnP-based pose initialization and joint optimization guided by geometric constraints, multi-view consistency, and feature-based alignment. The approach demonstrates state-of-the-art pose estimation and novel-view synthesis across ShapeScene, ToyDesk, DTU, and Replica, particularly excelling in sparse-view and large-baseline scenarios where COLMAP struggles, and shows robustness to probe choice. These results suggest practical applicability for real-world scenes lacking pose priors, enabling accurate rendering and geometry recovery with minimal input views.
Abstract
Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as "pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: \href{https://zhirui-gao.github.io/PoseProbe.github.io/}{this https URL}
