Table of Contents
Fetching ...

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J. Mitra, Paul Guerrero

TL;DR

RenderDiffusion introduces the first diffusion model capable of 3D content generation and inference trained solely on monocular 2D data. It achieves this by embedding a latent 3D triplane representation inside the denoiser and rendering via a volumetric NeRF-like pipeline, enforcing 3D consistency from a single view. A score-distillation regularizer and the ability to perform monocular 3D reconstruction, 3D-aware inpainting, and unconditional generation are demonstrated on ShapeNet, FFHQ, AFHQ, and CLEVR variants, with competitive results against specialized 3D methods. This approach enables view-consistent 3D generation directly from 2D supervision and supports 2D editing workflows for 3D scenes, representing a practical bridge between 2D diffusion models and 3D content creation.

Abstract

Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision. Central to our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure within the diffusion process, providing a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any view. We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes.

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

TL;DR

RenderDiffusion introduces the first diffusion model capable of 3D content generation and inference trained solely on monocular 2D data. It achieves this by embedding a latent 3D triplane representation inside the denoiser and rendering via a volumetric NeRF-like pipeline, enforcing 3D consistency from a single view. A score-distillation regularizer and the ability to perform monocular 3D reconstruction, 3D-aware inpainting, and unconditional generation are demonstrated on ShapeNet, FFHQ, AFHQ, and CLEVR variants, with competitive results against specialized 3D methods. This approach enables view-consistent 3D generation directly from 2D supervision and supports 2D editing workflows for 3D scenes, representing a practical bridge between 2D diffusion models and 3D content creation.

Abstract

Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision. Central to our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure within the diffusion process, providing a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any view. We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes.
Paper Structure (36 sections, 6 equations, 12 figures, 5 tables)

This paper contains 36 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We propose a 3D-aware image diffusion model that can be used for monocular 3D reconstruction, 3D-aware inpainting, and unconditional generation, while being trained with only monocular 2D supervision. Here we show results on ShapeNet and FFHQ.
  • Figure 2: Architecture overview. Images are generated by iteratively applying the denoiser $g_\theta$ to noisy input images, progressively removing the noise. Unlike traditional 2D diffusion models, our denoiser contains 3D structure in the form of a triplane representation $\mathbf{P}$ that is inferred from a noisy input image by the encoder $e_\phi$. A small MLP $s_\psi$ converts triplane features at arbitrary sample points into colors and densities that can then be rendered back into a denoised output image using a volumetric renderer.
  • Figure 3: RenderDiffusion results on FFHQ and AFHQ. We show reconstruction (top four rows), unconditional generation (bottom left), and 3D-aware inpainting (bottom right).
  • Figure 4: Reconstruction quality. We compare our results on ShapeNet car and plane to PixelNeRF and EG3D (through inversion). Compared to EG3D, our reconstructions better preserve shape identity; compared to PixelNeRF, ours are sharper and more detailed.
  • Figure 5: Unconditional generation. Results from RenderDiffusion and EG3D on ShapeNet categories; for RenderDiffusion, we show the view used during the reverse process in the first row. Note how our scenes have competitive quality and diversity compared to EG3D.
  • ...and 7 more figures