Table of Contents
Fetching ...

C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing

Zeng Tao, Zheng Ding, Zeyuan Chen, Xiang Zhang, Leizhi Li, Zhuowen Tu

TL;DR

C3Editor tackles view-inconsistency in 2D priors used for 3D editing by selecting a ground-truth (GT) view and GT-edited image to steer a view-consistent 2D editing model. It introduces a two-phase, GT-guided optimization with intra-GT prior fitting (via LoRA_gt) and progressive view propagation for inter-view consistency (via LoRA_mv), enabling controllable edits across all views. The method updates the 3D representation (Gaussian Splatting) using edited 2D views, achieving higher CLIP-Scores for text- and image-driven alignment and lower FID than baselines. Overall, C3Editor delivers more coherent 2D-to-3D edits with user-controllable directions, showing practical improvements for multi-view 3D editing tasks, though it requires scene-specific 2D models and leaves room for generalization to fully generic multi-view editing.

Abstract

Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.

C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing

TL;DR

C3Editor tackles view-inconsistency in 2D priors used for 3D editing by selecting a ground-truth (GT) view and GT-edited image to steer a view-consistent 2D editing model. It introduces a two-phase, GT-guided optimization with intra-GT prior fitting (via LoRA_gt) and progressive view propagation for inter-view consistency (via LoRA_mv), enabling controllable edits across all views. The method updates the 3D representation (Gaussian Splatting) using edited 2D views, achieving higher CLIP-Scores for text- and image-driven alignment and lower FID than baselines. Overall, C3Editor delivers more coherent 2D-to-3D edits with user-controllable directions, showing practical improvements for multi-view 3D editing tasks, though it requires scene-specific 2D models and leaves room for generalization to fully generic multi-view editing.

Abstract

Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.

Paper Structure

This paper contains 26 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: C$^{3}$Editor Method Pipeline. Given a 3D representation $\Phi$, a text prompt for editing $y$, and the original 2D editing model $\Theta_{O}$, our method aims to process $\Theta_{O}$ to obtain $\Theta_{C}$ that is related to $y$ and ensures multi-view consistency, thereby achieving improved 3D editing results. Phase 1: Controllable optimization direction selecting and manual editing in \ref{['sec:optimizationdirection']}. Phase 2: Intra-GT prior fitting in \ref{['sec:priorfitting']} to fit the GT information. Phase 3: View propagation and inter-view consistent construcing in \ref{['sec:viewpropagation']}. Details of LoRA modules for separate fine-tuning are in \ref{['sec:paralleltuning']}.
  • Figure 2: Comparison of Qualitative Results. Compared to baseline methods, C$^{3}$Editor can generate view-consistent 2D images, avoiding inter-view conflicts (highlighted in blue) and erroneous 2D edits (highlighted in red), thereby achieving better 3D editing results.
  • Figure 3: Controllable Editing Results with Different GT Selections. In C$^{3}$Editor, users can decide the optimization direction by selecting the GT edited image they prefer.
  • Figure 4: Controllable Editing Results with Manual Editing. In C$^{3}$Editor, users can edit the GT manually and obtain the corresponding 2D and 3D editing results.
  • Figure 5: Ablation Study on View Propagation. View propagation helps obtain more view-consistent results than the GT view.
  • ...and 6 more figures