Table of Contents
Fetching ...

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, Leonidas Guibas

TL;DR

MVEdit tackles the open challenge of 3D object synthesis in diffusion-based pipelines by introducing a training-free 3D Adapter that enables 3D-consistent, multi-view diffusion using off-the-shelf 2D diffusion models. It leverages ancestral sampling and a two-pass conditioning strategy with rendered views and ControlNets to maintain 3D coherence while preserving high visual quality, achieving fast 2-5 minute inference. A robust optimization framework for NeRF/Mesh (including RGBA, normal, and ray-entropy losses) and a progressive rendering schedule underpins reliable geometry and texture, complemented by StableSSDNeRF for fast text-to-3D initialization. The approach yields state-of-the-art results in image-to-3D and texture generation tasks, demonstrates broad applicability across 3D synthesis, editing, and texture upscaling, and offers a practical, data-efficient pathway for open-domain 3D content creation with widespread potential for deployment.

Abstract

Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

TL;DR

MVEdit tackles the open challenge of 3D object synthesis in diffusion-based pipelines by introducing a training-free 3D Adapter that enables 3D-consistent, multi-view diffusion using off-the-shelf 2D diffusion models. It leverages ancestral sampling and a two-pass conditioning strategy with rendered views and ControlNets to maintain 3D coherence while preserving high visual quality, achieving fast 2-5 minute inference. A robust optimization framework for NeRF/Mesh (including RGBA, normal, and ray-entropy losses) and a progressive rendering schedule underpins reliable geometry and texture, complemented by StableSSDNeRF for fast text-to-3D initialization. The approach yields state-of-the-art results in image-to-3D and texture generation tasks, demonstrates broad applicability across 3D synthesis, editing, and texture upscaling, and offers a practical, data-efficient pathway for open-domain 3D content creation with widespread potential for deployment.

Abstract

Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.
Paper Structure (36 sections, 10 equations, 12 figures, 3 tables)

This paper contains 36 sections, 10 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Comparison among 3D-aware multi-view denoising architectures. Adding skip connection around the 3D NeRF in (c) mitigates the potential blurriness issue in (b), but requires two 2D UNet passes within the same denoising timestep when extending the off-the-shelf 2D Stable Diffusion; our simplified architecture in (d) re-uses the denoised multi-view images from the last denoising timestep to reconstruct the 3D NeRF.
  • Figure 2: Comparison between the two architectures, based on the text-guided 3D-to-3D pipeline with $t^\text{start}=0.78T$. Rendered RGB images $x^\text{rend}_\text{RGB}$ across different timesteps are shown to visualize the sampling process.
  • Figure 3: The initialization and ancestral sampling process of MVEdit. The original single-image SDEdit is shown in blue, the additional 3D Adapter in red, and extra conditioning in orange. For brevity, only the first view is depicted, and VAE encoding/decoding is omitted in cases involving latent diffusion.
  • Figure 4: Text-guided 3D-to-3D using the same seed but different $t^\text{start}$.
  • Figure 5: Architecture of StableSSDNeRF, consisting of a frozen Stable Diffusion UNet with LoRA fine-tuning, and a triplane latent decoder.
  • ...and 7 more figures