Table of Contents
Fetching ...

Learning Naturally Aggregated Appearance for Efficient 3D Editing

Ka Leong Cheng, Qiuyu Wang, Zifan Shi, Kecheng Zheng, Yinghao Xu, Hao Ouyang, Qifeng Chen, Yujun Shen

TL;DR

AGAP tackles fast, interactive 3D editing by learning an explicit 2D canonical image $\phi_I$ and a projection field $P$ to aggregate appearance of a 3D scene represented by a density grid $\phi_G$. Editing is performed directly on the canonical image using 2D tools, with changes propagated to the 3D scene through $P$ without re-optimizing the underlying model, achieving at least a 20× speedup per edit. A learnable projection offset $P_o$ paired with a pseudo canonical projection $P_c$ preserves natural appearance across view changes and occlusions, and progressive training with data-type-specific canonical projections supports forward-facing, panorama, object-centric, and 360° data. Empirical results on diverse datasets demonstrate competitive fidelity and robust editing capabilities (stylization, segmentation, drawing) with significantly enhanced interactivity, highlighting practical impact for real-time 3D content creation. All mathematical notation is presented within $...$ for clarity and integration with downstream tooling.

Abstract

Neural radiance fields, which represent a 3D scene as a color field and a density field, have demonstrated great progress in novel view synthesis yet are unfavorable for editing due to the implicitness. This work studies the task of efficient 3D editing, where we focus on editing speed and user interactivity. To this end, we propose to learn the color field as an explicit 2D appearance aggregation, also called canonical image, with which users can easily customize their 3D editing via 2D image processing. We complement the canonical image with a projection field that maps 3D points onto 2D pixels for texture query. This field is initialized with a pseudo canonical camera model and optimized with offset regularity to ensure the naturalness of the canonical image. Extensive experiments on different datasets suggest that our representation, dubbed AGAP, well supports various ways of 3D editing (e.g., stylization, instance segmentation, and interactive drawing). Our approach demonstrates remarkable efficiency by being at least 20 times faster per edit compared to existing NeRF-based editing methods. Project page is available at https://felixcheng97.github.io/AGAP/.

Learning Naturally Aggregated Appearance for Efficient 3D Editing

TL;DR

AGAP tackles fast, interactive 3D editing by learning an explicit 2D canonical image and a projection field to aggregate appearance of a 3D scene represented by a density grid . Editing is performed directly on the canonical image using 2D tools, with changes propagated to the 3D scene through without re-optimizing the underlying model, achieving at least a 20× speedup per edit. A learnable projection offset paired with a pseudo canonical projection preserves natural appearance across view changes and occlusions, and progressive training with data-type-specific canonical projections supports forward-facing, panorama, object-centric, and 360° data. Empirical results on diverse datasets demonstrate competitive fidelity and robust editing capabilities (stylization, segmentation, drawing) with significantly enhanced interactivity, highlighting practical impact for real-time 3D content creation. All mathematical notation is presented within for clarity and integration with downstream tooling.

Abstract

Neural radiance fields, which represent a 3D scene as a color field and a density field, have demonstrated great progress in novel view synthesis yet are unfavorable for editing due to the implicitness. This work studies the task of efficient 3D editing, where we focus on editing speed and user interactivity. To this end, we propose to learn the color field as an explicit 2D appearance aggregation, also called canonical image, with which users can easily customize their 3D editing via 2D image processing. We complement the canonical image with a projection field that maps 3D points onto 2D pixels for texture query. This field is initialized with a pseudo canonical camera model and optimized with offset regularity to ensure the naturalness of the canonical image. Extensive experiments on different datasets suggest that our representation, dubbed AGAP, well supports various ways of 3D editing (e.g., stylization, instance segmentation, and interactive drawing). Our approach demonstrates remarkable efficiency by being at least 20 times faster per edit compared to existing NeRF-based editing methods. Project page is available at https://felixcheng97.github.io/AGAP/.
Paper Structure (16 sections, 14 equations, 12 figures, 4 tables)

This paper contains 16 sections, 14 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The overall pipeline. AGAP consists of two components: (1) an explicit 3D density grid $\phi_G$ to estimate geometry for density $\sigma$; (2) an explicit canonical image $\phi_I$ with an associated view-dependent projection field $P$ to aggregate appearance for color $\mathbf{c}$. By performing 2D image processing on the canonical image, our method enables various editing (e.g., instance segmentation, interactive drawing, and scene stylization) through volume rendering without the need for re-optimization.
  • Figure 2: Visual comparison of novel-view scene stylization results on the IN2N and LLFF dataset given different text prompts or image reference. Our method can achieve on-par stylization results with the baselines while requiring no time-consuming re-optimization procedures. As highlighted in row two, our method can better preserve color and textural consistencies aligning with the image reference.
  • Figure 3: By performing explicit edits on the canonical image $\phi_I$, our model propagates the editing effects through the learned projection field $P$ for efficient 3D editing.
  • Figure 4: More visualization of scene stylization results on the panorama Replica dataset given different text prompts.
  • Figure 5: Visual comparison of foreground and background segmentation results on the LLFF dataset.
  • ...and 7 more figures