Table of Contents
Fetching ...

Native 3D Editing with Full Attention

Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen

TL;DR

This work targets instruction-guided 3D editing by removing the bottlenecks of optimization-based and 2D-lifted approaches through a native, feed-forward framework. It introduces a large-scale, multi-modal dataset spanning addition, deletion, and modification, with careful curation to ensure geometry and appearance consistency. A key innovation is 3D token concatenation, a parameter-efficient conditioning strategy that outperforms traditional cross-attention in guiding a 3D diffusion-based model, enabling high-fidelity edits in around 20 seconds. Empirical results demonstrate state-of-the-art performance in generation quality, 3D consistency, and instruction fidelity, outperforming existing multi-view editing methods and establishing a new benchmark for native 3D editing.

Abstract

Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.

Native 3D Editing with Full Attention

TL;DR

This work targets instruction-guided 3D editing by removing the bottlenecks of optimization-based and 2D-lifted approaches through a native, feed-forward framework. It introduces a large-scale, multi-modal dataset spanning addition, deletion, and modification, with careful curation to ensure geometry and appearance consistency. A key innovation is 3D token concatenation, a parameter-efficient conditioning strategy that outperforms traditional cross-attention in guiding a 3D diffusion-based model, enabling high-fidelity edits in around 20 seconds. Empirical results demonstrate state-of-the-art performance in generation quality, 3D consistency, and instruction fidelity, outperforming existing multi-view editing methods and establishing a new benchmark for native 3D editing.

Abstract

Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.

Paper Structure

This paper contains 12 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1:
  • Figure 2: Overview of our proposed framework for native 3D editing. The pipeline manipulates 3D objects based on textual instructions, utilizing token concatenation as a parameter-efficient alternative to cross-attention, achieving superior editing performance without additional complexity.
  • Figure 3: Effectiveness of our method on deletion, addition and modification tasks. Our method facilitates precise and instruction-guided editing while maintaining the visual fidelity and structural coherence of the source object compared with other baselines.
  • Figure 4: Experimental results on the delete, add, and modify tasks demonstrate the effectiveness of our method. The 'Source Object' and instructions show inputs, and 'Ours' displays outputs. The method excels in diverse modifications, proving its precision and versatility.
  • Figure 5: Ablation studies on conditioning and data refinement strategies. (a) A qualitative comparison of conditioning strategies. Our token-concatenation approach successfully performs precise edits while preserving object consistency, whereas the cross-attention method results in corrupted geometry. (b) An ablation on data sources for modification tasks. Our final model, trained on a curated dataset lifted by Hunyuan3D 2.1 ("Ours"), achieves higher fidelity than models trained on uncurated data from either TRELLIS or Hunyuan3D 2.1 alone.