Table of Contents
Fetching ...

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Jun-Kun Chen, Samuel Rota Bulò, Norman Müller, Lorenzo Porzi, Peter Kontschieder, Yu-Xiong Wang

TL;DR

This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing and stands as the first work capable of success-fully editing complex patterns.

Abstract

This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models, our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically, we design surrounding views as context-rich input for the 2D diffusion model, and generate 3D-consistent, structured noise instead of image-independent noise. Moreover, we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions, particularly in complicated large-scale indoor scenes from ScanNet++, with significantly improved sharpness and fine-grained textures. Notably, ConsistDreamer stands as the first work capable of successfully editing complex (e.g., plaid/checkered) patterns. Our project page is at immortalco.github.io/ConsistDreamer.

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

TL;DR

This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing and stands as the first work capable of success-fully editing complex patterns.

Abstract

This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models, our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically, we design surrounding views as context-rich input for the 2D diffusion model, and generate 3D-consistent, structured noise instead of image-independent noise. Moreover, we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions, particularly in complicated large-scale indoor scenes from ScanNet++, with significantly improved sharpness and fine-grained textures. Notably, ConsistDreamer stands as the first work capable of successfully editing complex (e.g., plaid/checkered) patterns. Our project page is at immortalco.github.io/ConsistDreamer.
Paper Structure (54 sections, 11 figures, 2 tables)

This paper contains 54 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Our ConsistDreamer lifts 2D diffusion with 3D awareness and consistency, achieving high-fidelity instruction-guided scene editing with superior sharpness and detailed textures. Left: The three synergistic components within ConsistDreamer that enable 3D consistency. Right: State-of-the-art performance of ConsistDreamer across various editing tasks and scenes, especially when prior work (e.g., IN2N in2n) fails and in challenging large-scale indoor scenes from ScanNet++ scannetpp. More results are on our https://immortalco.github.io/ConsistDreamer/.
  • Figure 2: ConsistDreamer framework is an IN2N-like in2n pipeline containing two major procedures. (a) In the NeRF fitting procedure, we continuously train NeRF with a buffer of edited views. (b) In diffusion generation and training, we add our 3D-consistent structured noise to rendered multi-view images, and compose surrounding views with them, as input to the augmented 3D-aware 2D diffusion ip2p. We then add the edited images to the buffer for (a), and apply self-supervised consistency-enforcing training using the consistency-warped images. Note: images of structured noise are only for illustration -- they are actually visually indistinguishable from Gaussian noise images.
  • Figure 3: Comparison in the Fangzhou scene shows that our ConsistDreamer produces significantly sharper editing results with more fine-grained textures and higher consistency with the instruction, e.g., Lord Voldemort with no hair on which all baselines fail. The instructions are the bottom texts, except for NArt nerfart, which uses the underlined texts. The images of baselines are taken from their paper.
  • Figure 4: ConsistDreamer consistently generates high-quality and high-fidelity editing results, featuring detailed, fine-grained textures across various scenes and instructions. Notably, ConsistDreamer also maintains the high diversity from ip2p, as exemplified by the highly diversified results (a)(b). Additional results and comparisons are provided in the supplementary and on our project page.
  • Figure B.0: Qualitative comparisons with baseline CSD on three tasks show that our ConsistDreamer achieves high-quality editing, outperforming both IN2N and CSD with more successful editing.
  • ...and 6 more figures