Table of Contents
Fetching ...

EditP23: 3D Editing via Propagation of Image Prompts to Multi-View

Roi Bar-On, Dana Cohen-Bar, Daniel Cohen-Or

TL;DR

EditP23 tackles mask-free 3D editing by propagating a single 2D edit across a multi-view representation using a pre-trained diffusion backbone. It introduces an edit-aware denoising mechanism guided by an image pair (original and edited view) and employs correlated noise to isolate and propagate the edit while preserving object identity. The approach is training-free and feed-forward, delivering fast edits that maintain 3D consistency and outperform mask-free baselines in both quantitative metrics and user studies. The work demonstrates broad applicability across object categories and edit types, with ablations validating the core design choices and a reconstruction pipeline enabling final 3D assets.

Abstract

We present EditP23, a method for mask-free 3D editing that propagates 2D image edits to multi-view representations in a 3D-consistent manner. In contrast to traditional approaches that rely on text-based prompting or explicit spatial masks, EditP23 enables intuitive edits by conditioning on a pair of images: an original view and its user-edited counterpart. These image prompts are used to guide an edit-aware flow in the latent space of a pre-trained multi-view diffusion model, allowing the edit to be coherently propagated across views. Our method operates in a feed-forward manner, without optimization, and preserves the identity of the original object, in both structure and appearance. We demonstrate its effectiveness across a range of object categories and editing scenarios, achieving high fidelity to the source while requiring no manual masks.

EditP23: 3D Editing via Propagation of Image Prompts to Multi-View

TL;DR

EditP23 tackles mask-free 3D editing by propagating a single 2D edit across a multi-view representation using a pre-trained diffusion backbone. It introduces an edit-aware denoising mechanism guided by an image pair (original and edited view) and employs correlated noise to isolate and propagate the edit while preserving object identity. The approach is training-free and feed-forward, delivering fast edits that maintain 3D consistency and outperform mask-free baselines in both quantitative metrics and user studies. The work demonstrates broad applicability across object categories and edit types, with ablations validating the core design choices and a reconstruction pipeline enabling final 3D assets.

Abstract

We present EditP23, a method for mask-free 3D editing that propagates 2D image edits to multi-view representations in a 3D-consistent manner. In contrast to traditional approaches that rely on text-based prompting or explicit spatial masks, EditP23 enables intuitive edits by conditioning on a pair of images: an original view and its user-edited counterpart. These image prompts are used to guide an edit-aware flow in the latent space of a pre-trained multi-view diffusion model, allowing the edit to be coherently propagated across views. Our method operates in a feed-forward manner, without optimization, and preserves the identity of the original object, in both structure and appearance. We demonstrate its effectiveness across a range of object categories and editing scenarios, achieving high fidelity to the source while requiring no manual masks.

Paper Structure

This paper contains 24 sections, 1 equation, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Overview of our edit-aware denoising mechanism at a single timestep. Top branch: The original source grid is fed to the multi-view diffusion model along with the source condition view to predict the velocity towards the source. Bottom branch: The current edited grid is conditioned on the target view to predict the velocity towards the target. The resulting delta isolates the edit and guides the subsequent update of the edited grid.
  • Figure 2: Comparison with a Naı̈ve Baseline. We compare our method with the baseline on two examples: R2D2 (top) and a yellow LEGO car (bottom). The baseline conditions the multi-view diffusion model directly on the edited view. In contrast, our method uses edit-aware denoising to propagate the intended edit consistently across the entire object while preserving structure and appearance. Each example is shown in four columns: editing condition (source and target views), the rendered source object, our result, and the baseline. For each edit, we display two viewpoints. The baseline struggles to retain key semantic features, e.g., hallucinating geometry on R2D2, whereas our method applies the changes coherently and meaningfully, even in the generic LEGO case, without relying on masks or frontal supervision.
  • Figure 3: Qualitative Results of EditP23. This figure showcases results across diverse object categories. Each block compares a source object (top) with its edited version (bottom). The leftmost column displays the conditioning views (source and target) used to prompt the edit, while the remaining columns show novel views of the result. Our approach consistently applies the desired edit while preserving the object's structure and identity across all viewpoints.
  • Figure 4: Qualitative Comparison with Baseline 3D Editing Methods. The columns correspond to the requested edits ("with headphones", "with pagoda roof", "cartoonish"); each cell shows two canonical views of the edited object. Rows list the original input views and the results produced by Vox-E, MVEdit, Instant3Dit, and our method. Instant3Dit is a mask-based local editor and cannot perform a global style change such as the cartoonish car; its entry is therefore marked "N/A" in the last column.
  • Figure 5: Human Evaluation Study Results. EditP23 was compared with two baseline approaches in a 2-alternative. Raters strongly favored EditP23 for better editing.
  • ...and 5 more figures