Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong
TL;DR
This paper tackles single-image 3D editing for open-domain images, addressing the limitation that existing methods rely on synthetic multi-view data. It introduces a tuning-free framework that leverages pre-trained diffusion models as both appearance priors and geometry priors, forming a depth-assisted, three-phase loop: view synthesis, undistortion, and shape alignment. The approach achieves large viewpoint changes with high texture and shape consistency, using diffusion-based inpainting, LoRA adaptation, and DDIM inversion to refine geometry and appearance without additional training. Experimental results on a diverse benchmark show superior qualitative and quantitative performance against prior methods, with strong user preferences, highlighting the method's practical potential for creative design and AR applications. The work demonstrates the value of diffusion priors in enabling robust, open-domain 3D editing from a single image, bridging 2D diffusion generation with 3D-consistent manipulations.
Abstract
We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.
