Table of Contents
Fetching ...

Denoising Diffusion via Image-Based Rendering

Titas Anciukevičius, Fabian Manhardt, Federico Tombari, Paul Henderson

TL;DR

This work presents Generative Image-Based Rendering (GIBR), a diffusion-based framework that learns a priors over large-scale, unbounded 3D scenes from multi-view 2D images without explicit 3D supervision. It introduces IB-planes, a flexible, image-based 3D representation that fuses per-view features into a coherent latent; a 3D-consistent denoising mechanism ensures multi-view outputs depict a single, consistent scene. A dropout-based strategy prevents trivial solutions and enables robust 3D reconstruction and unconditional generation across varying conditioning views. Empirical results on MVImgNet, CO3D, and ShapeNet demonstrate superior performance for reconstruction, novel-view synthesis, and unconditional generation compared to strong baselines, highlighting the method’s potential for scalable 3D content creation from in-the-wild imagery.

Abstract

Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.

Denoising Diffusion via Image-Based Rendering

TL;DR

This work presents Generative Image-Based Rendering (GIBR), a diffusion-based framework that learns a priors over large-scale, unbounded 3D scenes from multi-view 2D images without explicit 3D supervision. It introduces IB-planes, a flexible, image-based 3D representation that fuses per-view features into a coherent latent; a 3D-consistent denoising mechanism ensures multi-view outputs depict a single, consistent scene. A dropout-based strategy prevents trivial solutions and enables robust 3D reconstruction and unconditional generation across varying conditioning views. Empirical results on MVImgNet, CO3D, and ShapeNet demonstrate superior performance for reconstruction, novel-view synthesis, and unconditional generation compared to strong baselines, highlighting the method’s potential for scalable 3D content creation from in-the-wild imagery.

Abstract

Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.
Paper Structure (47 sections, 3 equations, 7 figures, 4 tables)

This paper contains 47 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Our neural scene representation IB-planes defines 3D content using image-space features. Each camera $\pi_v$ is associated with a feature-map $\mathbf{f}_v$ (blue); together both parametrise a neural field that defines density and color for each 3D point $p$ (red dot). We incorporate this representation in a diffusion model over multi-view images. At each denoising step, noisy images ${\bf x}^{(t)}$ are encoded by a U-Net $E$ with cross-view attention (gray dashed arrows), that yields pixel-aligned features $\mathbf{f}_v$ (blue). To render pixels of denoised images (only one ${\bf x}^{(0)}$ is shown for clarity), we use volumetric ray-marching (green arrow), decoding features unprojected (red lines) from the other viewpoints.
  • Figure 2: Samples generated by our method trained on MVImgNet (first three rows), CO3D (last three rows). Note that each multi-view image depicts a single coherent scene, with plausible appearance and detailed geometry. Please see the supplementary material for $1024\times1024$ video visualisations.
  • Figure 3: Results from our model on 3D reconstruction from a single image on MVImgNet (first 3 rows), CO3D (next 3 rows) and ShapeNet (last row). The leftmost column is the input; the next four show the ground-truth novel view images. The remaining columns show our model's prediction from those viewpoints and the predicted depth-maps. Please see the supplementary videos for more results.
  • Figure 4: Results from our model on 3D reconstruction from six images. The leftmost columns are the input (conditioning) views; the next two columns show ground-truth images at novel viewpoints. The remaining columns show our model's 3D reconstruction rendered from those viewpoints, as well as the predicted depth-maps. Note how the model faithfully reconstructs the geometric and textural details visible in its input images.
  • Figure 5: Single-view 3D reconstruction . The first row shows the conditioning (input) image, and the second row show the ground-truth for novel views. The subsequent two rows show 3D scenes sampled by our model, showing predicted views and depth maps. Corresponding results from baseline models and the 2D multi-view diffusion ablation study are also shown. Note that the 2D multi-view diffusion ablation does not generate depth maps; in this case, two rows show multi-view image samples generated using different random seeds. Our model demonstrates high-fidelity reconstruction of 3D scenes with plausible reconstructions of unseen regions. In comparison, RenderDiffusion++ samples 3D of low fidelity, while PixelNeRF++ fails render plausible details in unobserved areas. Viewset Diffusion performs well on MVImageNet, but for the larger outdoor scenes in CO3D it often renders floaters or foggy surfaces. We also see that 2D multi-view diffusion (ablation of our model) produces images that are realistic in isolation; however, they are 3D inconsistent and often do not match the ground-truth pose of the object.
  • ...and 2 more figures