Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion
Xin Yuan, Rana Hanocka, Michael Maire
TL;DR
This work tackles multiview 3D reconstruction from images with unknown camera poses by casting it as a generative denoising problem. A pose-predictor encoder and a Neural Radiance Field are wrapped inside a denoising diffusion probabilistic model and trained end-to-end with a standard denoising objective, encouraging the system to infer both view correspondences and a coherent 3D representation. A key contribution is a pose distribution mechanism and multi-pose rendering that enables robust 360-degree scene reconstruction where prior methods fail, along with capabilities for novel view generation. The approach yields unsupervised NeRF reconstructions, plausible novel views, and a pathway toward scalable, annotation-free 3D understanding of complex scenes.
Abstract
We cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images.
