Table of Contents
Fetching ...

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong

TL;DR

This paper tackles single-image 3D editing for open-domain images, addressing the limitation that existing methods rely on synthetic multi-view data. It introduces a tuning-free framework that leverages pre-trained diffusion models as both appearance priors and geometry priors, forming a depth-assisted, three-phase loop: view synthesis, undistortion, and shape alignment. The approach achieves large viewpoint changes with high texture and shape consistency, using diffusion-based inpainting, LoRA adaptation, and DDIM inversion to refine geometry and appearance without additional training. Experimental results on a diverse benchmark show superior qualitative and quantitative performance against prior methods, with strong user preferences, highlighting the method's practical potential for creative design and AR applications. The work demonstrates the value of diffusion priors in enabling robust, open-domain 3D editing from a single image, bridging 2D diffusion generation with 3D-consistent manipulations.

Abstract

We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

TL;DR

This paper tackles single-image 3D editing for open-domain images, addressing the limitation that existing methods rely on synthetic multi-view data. It introduces a tuning-free framework that leverages pre-trained diffusion models as both appearance priors and geometry priors, forming a depth-assisted, three-phase loop: view synthesis, undistortion, and shape alignment. The approach achieves large viewpoint changes with high texture and shape consistency, using diffusion-based inpainting, LoRA adaptation, and DDIM inversion to refine geometry and appearance without additional training. Experimental results on a diverse benchmark show superior qualitative and quantitative performance against prior methods, with strong user preferences, highlighting the method's practical potential for creative design and AR applications. The work demonstrates the value of diffusion priors in enabling robust, open-domain 3D editing from a single image, bridging 2D diffusion generation with 3D-consistent manipulations.

Abstract

We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.
Paper Structure (39 sections, 3 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 39 sections, 3 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: 3D-aware image editing results of our proposed method. Our method enables 3D manipulations of objects with consistent appearance, plausible layout, and harmonious composition including occlusion (e.g., the first two examples of translation editing), by using pre-trained diffusion models. (Best viewed with zoom-in)
  • Figure 2: The overall pipeline. Our 3D-aware editing method iterates among three phases. The view synthesis phase generates the novel view of the selected object using depth-based warping and layered diffusion inpainting (initial depth map obtained by monocular depth estimation). The undistortion phase rectifies the potential distortions on target-view image induced by inferior depth estimate. The shape alignment phase aligns the object shape in the original input image to the undistorted target image by optimizing the depth map and minimizing dense image correspondences. After several iterations, this process yields plausible and consistent 3D editing results.
  • Figure 3: Illustration of our view synthesis stage with a layered, diffusion-based generative inpainting scheme.
  • Figure 4: Intermediate results of our iterative algorithm. (Best viewed with zoom-in; see text in Sec. \ref{['sec:visual']} for detailed explainations)
  • Figure 5: Visual comparison of different methods. The bottom-right figures in the first row depict the initial and target poses for editing. The results of Zero123 and Stable Zero123 are postprocessed by overlaying the generated objects onto the inpainted background. (Best viewed with zoom-in; see the suppl. material for more results.)
  • ...and 6 more figures