Table of Contents
Fetching ...

Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion

Xin Yuan, Rana Hanocka, Michael Maire

TL;DR

This work tackles multiview 3D reconstruction from images with unknown camera poses by casting it as a generative denoising problem. A pose-predictor encoder and a Neural Radiance Field are wrapped inside a denoising diffusion probabilistic model and trained end-to-end with a standard denoising objective, encouraging the system to infer both view correspondences and a coherent 3D representation. A key contribution is a pose distribution mechanism and multi-pose rendering that enables robust 360-degree scene reconstruction where prior methods fail, along with capabilities for novel view generation. The approach yields unsupervised NeRF reconstructions, plausible novel views, and a pathway toward scalable, annotation-free 3D understanding of complex scenes.

Abstract

We cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images.

Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion

TL;DR

This work tackles multiview 3D reconstruction from images with unknown camera poses by casting it as a generative denoising problem. A pose-predictor encoder and a Neural Radiance Field are wrapped inside a denoising diffusion probabilistic model and trained end-to-end with a standard denoising objective, encouraging the system to infer both view correspondences and a coherent 3D representation. A key contribution is a pose distribution mechanism and multi-pose rendering that enables robust 360-degree scene reconstruction where prior methods fail, along with capabilities for novel view generation. The approach yields unsupervised NeRF reconstructions, plausible novel views, and a pathway toward scalable, annotation-free 3D understanding of complex scenes.

Abstract

We cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images.
Paper Structure (15 sections, 4 equations, 14 figures, 2 tables, 2 algorithms)

This paper contains 15 sections, 4 equations, 14 figures, 2 tables, 2 algorithms.

Figures (14)

  • Figure 1: Wrapping NeRF inside Diffusion. We learn a 3D scene reconstruction by training a denoising diffusion model (DDPM) on a dataset of 2D views of the scene. The architecture of our DDPM consists of two components. Left: An Encoder predicts the pose of a single noisy 2D input image. Right: A NeRF is rendered from the predicted camera pose to create a 2D output image that is treated as the predicted denoising of the input view. The system must learn parameters of both the Encoder and NeRF so that any 2D view can be denoised by predicting a camera and rendering the scene. The NeRF rendering process is differentiable with respect to rays shot from the camera, which themselves depend on the camera-to-world transformation matrix produced by the encoder. All modules are end-to-end trainable, and the system is optimized by the simple MSE loss on denoising.
  • Figure 2: Unifying pose prediction, 3D reconstruction, and novel-view image generation. Our trained system (Figure \ref{['fig:framework']}) can be deployed for multiple tasks. Pose prediction (top): We can predict the pose of a previously unseen real image by adding a small amount of noise (forward diffusion) and feeding it to our Encoder (Fig \ref{['fig:framework']}, left). Rendering our learned NeRF from that camera pose should reconstruct the real image. Direct NeRF usage (middle): Our learned NeRF can be extracted and directly used to render the scene (e.g., along a manually specified camera path). Sampling cameras and views (bottom): Performing sequential diffusion denoising from pure Gaussian noise input synthesizes a camera pose from which rendering the NeRF generates a novel view of the scene.
  • Figure 3: Pose distribution representation and multi-pose rendering for 360$^{\circ}$ scenes. In order to perform view denoising by learning a NeRF and predicting the pose from which to render it, our system (Figure \ref{['fig:framework']}) must implicitly solve multiview correspondence by mapping training images (of unknown pose) into consistent locations in the 3D environment. We enable training via gradient descent to discover such solutions for challenging multiview datasets (e.g., spanning 360$^{\circ}$) by augmenting our architecture with the capacity to represent uncertainty over a pose distribution. Left: Our encoder, given a noisy image ${\bm{x}}_t$, predicts parameters for multiple cameras and a corresponding probability distribution over cameras, ${\bm{s}}_{pose}$. Right: During training, we render the NeRF from each predicted camera and use the best reconstruction to calculate the denoising loss; an auxiliarly classification loss pushes the predicted camera distribution to upweight the selected output. At test time, we render using only the single camera predicted as most likely by the classifier.
  • Figure 4: Reconstructions of images unseen during training on three scenes from LLFF mildenhall2019llff.
  • Figure 5: Reconstructions on $360^\circ$ scenes.
  • ...and 9 more figures