Table of Contents
Fetching ...

Fast Multi-view Consistent 3D Editing with Video Priors

Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, Lei Zhang

TL;DR

ViP3DE introduces video priors into text-driven 3D editing to enforce multi-view consistency in a single forward pass. By applying motion-preserved noise blending and geometry-aware denoising during diffusion, it generates edited views conditioned on a single frame and updates the 3D Gaussian representation accordingly. The approach yields higher editing fidelity and faster convergence than prior 3D- or video-based methods, while maintaining pose consistency across views. Limitations remain for drastic geometric edits, pointing to future work in expanding editing capabilities and geometry-aware priors.

Abstract

Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.

Fast Multi-view Consistent 3D Editing with Video Priors

TL;DR

ViP3DE introduces video priors into text-driven 3D editing to enforce multi-view consistency in a single forward pass. By applying motion-preserved noise blending and geometry-aware denoising during diffusion, it generates edited views conditioned on a single frame and updates the 3D Gaussian representation accordingly. The approach yields higher editing fidelity and faster convergence than prior 3D- or video-based methods, while maintaining pose consistency across views. Limitations remain for drastic geometric edits, pointing to future work in expanding editing capabilities and geometry-aware priors.

Abstract

Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.

Paper Structure

This paper contains 15 sections, 9 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Motivation of ViP3DE. (a) Most existing studies instructn2nshap-editorclip-nerfgenn2ndgesyncnoiseconsistdreamer employ pre-trained 2D models to iteratively update 3D assets, suffering from slow convergence and over-smoothed textures. (b) ViP3DE integrates video priors and source 3D priors to achieve multi-view consistent editing with a single pass.
  • Figure 2: Workflow of ViP3DE. First, the contiguous multi-view images are rendered from the source 3D representation as source video. Then, the first frame is edited with InstructPix2Pix as the condition of the video model, and the initial noise of the diffusion process is obtained by motion-preserved noise blending. Consequently, the geometric priors excavated from the source 3D representation are introduced during the video denoising process to improve 3D consistency across views, termed geometry-aware denoising. Finally, these edited multi-view images are utilized to update the source 3D representation. Thanks to video priors, ViP3DE achieves fast and multi-view consistent 3D editing in a single forward pass.
  • Figure 3: Demonstration of motion-preserved noise blending. Appearance and pose alignment exhibit different levels of robustness to noise.
  • Figure 4: Qualitative comparison. We highlight the edited results suffering from inconsistency (red boxes), poor details (yellow boxes), and unfaithfulness (blue boxes). In comparison, ViP3DE obtains consistent results with higher faithfulness to instruction. Besides, rich details are preserved by avoiding multiple iterations that typically cause over-smoothed textures.
  • Figure 5: The editing results with CLIP temporal score in a single forward pass. ViP3DE achieves better consistency.
  • ...and 11 more figures