Generic 3D Diffusion Adapter Using Controlled Multi-View Editing
Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, Leonidas Guibas
TL;DR
MVEdit tackles the open challenge of 3D object synthesis in diffusion-based pipelines by introducing a training-free 3D Adapter that enables 3D-consistent, multi-view diffusion using off-the-shelf 2D diffusion models. It leverages ancestral sampling and a two-pass conditioning strategy with rendered views and ControlNets to maintain 3D coherence while preserving high visual quality, achieving fast 2-5 minute inference. A robust optimization framework for NeRF/Mesh (including RGBA, normal, and ray-entropy losses) and a progressive rendering schedule underpins reliable geometry and texture, complemented by StableSSDNeRF for fast text-to-3D initialization. The approach yields state-of-the-art results in image-to-3D and texture generation tasks, demonstrates broad applicability across 3D synthesis, editing, and texture upscaling, and offers a practical, data-efficient pathway for open-domain 3D content creation with widespread potential for deployment.
Abstract
Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.
