Table of Contents
Fetching ...

CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion

Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei

TL;DR

CoreEditor addresses the problem of inconsistent multi-view edits in text-driven 3D editing by introducing a correspondence-constrained attention mechanism that enforces precise cross-view interactions during diffusion. It combines a geometric plus semantic co-supported correspondence strategy with a Reference Attention pipeline to align global editing styles and maintain local consistency without fine-tuning the diffusion model. The key contributions are the Correspondence-constrained Attention (CCA), the geometric+semantic correspondence framework, and the selective editing pipeline enabling user-controlled, diverse, yet faithful edits. Empirical results across seven scenes demonstrate superior 3D consistency, sharper textures, and higher semantic fidelity compared to state-of-the-art baselines, while maintaining efficiency and zero-shot deployment. This work advances practical, high-quality 3D editing workflows for neural scene representations such as Gaussian Splatting.

Abstract

Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion

TL;DR

CoreEditor addresses the problem of inconsistent multi-view edits in text-driven 3D editing by introducing a correspondence-constrained attention mechanism that enforces precise cross-view interactions during diffusion. It combines a geometric plus semantic co-supported correspondence strategy with a Reference Attention pipeline to align global editing styles and maintain local consistency without fine-tuning the diffusion model. The key contributions are the Correspondence-constrained Attention (CCA), the geometric+semantic correspondence framework, and the selective editing pipeline enabling user-controlled, diverse, yet faithful edits. Empirical results across seven scenes demonstrate superior 3D consistency, sharper textures, and higher semantic fidelity compared to state-of-the-art baselines, while maintaining efficiency and zero-shot deployment. This work advances practical, high-quality 3D editing workflows for neural scene representations such as Gaussian Splatting.

Abstract

Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

Paper Structure

This paper contains 20 sections, 4 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Key features of our method and visual comparison with the recent GaussCtrl gaussctrl method. (a) Visual comparison of edited multi-view training images. CoreEditor integrates geometric and semantic correspondences into the T2I diffusion model, ensuring 3D-consistent edits. (b) Visual comparison of rendered edited results. With consistent multi-view images, CoreEditor generates results with sharper textures.
  • Figure 2: Overview of CoreEditor. Our method edits the rendered multi-view images ($\mathcal{I}$) into a consistent image set $\mathcal{I}^e$, which is then used to update the original GS model. The process ensures 3D consistency through two key steps: (1) Once the user selects a preferred edit, $I^r$, we integrate its pattern into the diffusion model using Reference Attention. (2) After the geometry and semantic co-supported correspondence set has been established, we inject it into the diffusion model by Correspondence-constrained Attention.
  • Figure 3: Difference between the calculation of RA, SA, and CCA. Compared with the original SA, RA regards the selected edit as an additional set of key and value. To improve local consistency, CCA enforces an image patch token to only interact with the corresponding patches in other views.
  • Figure 4: Visual comparison with state-of-the-art methods gaussianeditor2dgegaussctrlin2024editsplat in the "bear" and "stone horse" scenes. We provide results rendered from two views for each edited scene. Blurry regions are highlighted with yellow dash boxes.
  • Figure 5: Visual comparison with state-of-the-art methods gaussianeditor2dgegaussctrlin2024editsplat in the "face", "garden", and "bicycle" scenes. We provide results rendered from two views for each edited scene. Blurry regions are highlighted with yellow dash boxes.
  • ...and 7 more figures