SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

Jan-Niklas Dihlmann; Andreas Engelhardt; Hendrik Lensch

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

Jan-Niklas Dihlmann, Andreas Engelhardt, Hendrik Lensch

TL;DR

This work finds that depth-conditioned diffusion models inherently possess the capability to generate 3D consistent views by requesting a grid of images instead of single views, and proposes SIGNeRF, a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation.

Abstract

Advances in image diffusion models have recently led to notable improvements in the generation of high-quality images. In combination with Neural Radiance Fields (NeRFs), they enabled new opportunities in 3D generation. However, most generative 3D approaches are object-centric and applying them to editing existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation. A new generative update strategy ensures 3D consistency across the edited images, without requiring iterative optimization. We find that depth-conditioned diffusion models inherently possess the capability to generate 3D consistent views by requesting a grid of images instead of single views. Based on these insights, we introduce a multi-view reference sheet of modified images. Our method updates an image collection consistently based on the reference sheet and refines the original NeRF with the newly generated image set in one go. By exploiting the depth conditioning mechanism of the image diffusion model, we gain fine control over the spatial location of the edit and enforce shape guidance by a selected region or an external mesh.

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

TL;DR

Abstract

Paper Structure (36 sections, 3 equations, 10 figures, 2 tables)

This paper contains 36 sections, 3 equations, 10 figures, 2 tables.

Introduction
Related Work
Text-to-Image Generation
Text-to-3D Generation
NeRF Editing
Generative NeRF Editing
Method
Background
NeRF
ControlNet
Controlled Consistent Generation
Reference Sheet Generation
Image Set Update
Scene Integrated Generation
Selection Modes
...and 21 more sections

Figures (10)

Figure 1: SIGNeRF pipeline for NeRF scene editing -- here, object generation. First, the original NeRF scene is trained (1), and a proxy object is placed into the scene (2). After a precise selection, we place reference cameras ($5$ here) into the scene (3), render the corresponding color, depth, and mask images, and arrange them into image grids (4). These grids are used to generate the reference sheet with conditioned image diffusion (5). To propagate the edits to the entire image set, for each camera, a color, depth, and mask image are rendered and placed into the empty slot of the fixed reference sheet. We generate a new edited image consistent with the reference sheet by leveraging an inpainting mask. The step is repeated for all cameras (6). Finally, the NeRF is fine-tuned on the edited images (7)
Figure 2: Object insertion and object modification. -- (top) The cow geometry is centrally placed on a meadow to obtain a photorealistic scene. (middle) Note how occlusions are properly handled when generating the synthetic house based on the inserted proxy. (bottom) Objects can easily be transformed based on a prompt. Due to the more complex surface texture and geometric changes the pirate and the Batman costume required an additional iteration to obtain the same level of consistency compared to the simpler sports clothes.
Figure 3: Reference Sheet Generation -- Using ControlNet ControlNet inpainting to edit scene parts image-by-image results in drastically different looks per view (left) although all parameters and the seed are the same. In contrast, we obtain a consistent reference sheet (right) by arranging the input images into a grid, letting ControlNet process the entire sheet in a single generation step.
Figure 4: Qualitative Comparison -- SIGNeRF results are compared to Instruct-NeRF2NeRF Instruct-NeRF2NeRF (top) and DreamEditor DreamEditor (bottom). For the bear, the generated fur texture with SIGNeRF (left) shows a more distinguished structure and the snout regions is clearly more consistent. Compared to DreamEditor the images are different but the image quality comparable.
Figure 5: Influence of the proxy geometry -- The synthetic cow is generated with three different proxy meshes with the prompt "A brown cow". From left to right: High-poly proxy mesh, low-poly proxy mesh, and simple geometric primitives.
...and 5 more figures

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

TL;DR

Abstract

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

Authors

TL;DR

Abstract

Table of Contents

Figures (10)