Table of Contents
Fetching ...

Towards Scalable and Consistent 3D Editing

Ruihao Xia, Yang Tang, Pan Zhou

TL;DR

This work tackles the bottlenecks of 3D editing by introducing 3DEditVerse, the largest paired 3D editing dataset with 116,309 training pairs and 1,500 test pairs, and 3DEditFormer, a 3D-structure-preserving conditional transformer. 3DEditFormer leverages dual-guidance attention, multi-stage feature extraction, and time-adaptive gating to disentangle editable regions from preserved structure, enabling precise, localized edits without requiring auxiliary 3D masks. Across comprehensive experiments, the approach achieves state-of-the-art performance on both 3D and 2D evaluation metrics, demonstrating superior fidelity and consistency while offering a practical, mask-free editing pipeline. The dataset and code are slated for release, accelerating research and practical deployment in scalable 3D editing for content creation and AR/VR applications.

Abstract

3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

Towards Scalable and Consistent 3D Editing

TL;DR

This work tackles the bottlenecks of 3D editing by introducing 3DEditVerse, the largest paired 3D editing dataset with 116,309 training pairs and 1,500 test pairs, and 3DEditFormer, a 3D-structure-preserving conditional transformer. 3DEditFormer leverages dual-guidance attention, multi-stage feature extraction, and time-adaptive gating to disentangle editable regions from preserved structure, enabling precise, localized edits without requiring auxiliary 3D masks. Across comprehensive experiments, the approach achieves state-of-the-art performance on both 3D and 2D evaluation metrics, demonstrating superior fidelity and consistency while offering a practical, mask-free editing pipeline. The dataset and code are slated for release, accelerating research and practical deployment in scalable 3D editing for content creation and AR/VR applications.

Abstract

3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

Paper Structure

This paper contains 25 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Some examples of our 3DEditVerse dataset. See more examples in Appendix \ref{['appendix:dataset_vis']}.
  • Figure 2: Overview of our data generation pipeline for text-guided 3D editing. Starting from a large-scale Vocabulary Set, we employ multiple foundation models in a carefully orchestrated manner and construct the text-to-image-to-3D lifting pipeline.
  • Figure 3: Overview of our proposed 3DEditFormer. (a) Multi-stage features $\{f^{(1,i)}_{3D}\}_{i=1}^N$ and $\{f^{(2,i)}_{3D}\}_{i=1}^N$ are extracted from the frozen Trellis model Trellis at different denoising timesteps, capturing fine-grained structural priors and semantic transition cues, respectively. (b) These features are injected into each transformer layer via (c) Dual-Guidance Attention Block, where their contributions are modulated by (d) Time-Adaptive Gating mechanism.
  • Figure 4: Qualitative comparison among our proposed 3DEditFormer and SoTAs, including EditP23 EditP23, Instant3dit Instant3dit, and VoxHammer VoxHammer on our proposed 3DEditVerse test set. More visualizations are provided in appendix \ref{['appendix:compare_vis']} and \ref{['appendix:mixamo_pred']}.
  • Figure 5: More examples of (a) Character–Animation Compositions and (b) generative data from text-guided editing in our proposed 3DEditVerse dataset.
  • ...and 3 more figures