Table of Contents
Fetching ...

Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Chen Liyi, Wang Pengfei, Zhang Guowen, Ma Zhiyuan, Zhang Lei

Abstract

Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.

Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Abstract

Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.
Paper Structure (11 sections, 2 equations, 7 figures, 5 tables)

This paper contains 11 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Motivation of Omni-3DEdit. (a) 3D editing via iterative 2D-3D-2D optimization with explicit 3D representation lacks generality and is time-consuming. (b) Performing 3D editing in latent space is hard to handle scene-level assets with arbitrary viewpoints. (c) Our Omni-3DEdit aims to solve these issues in multi-view space to perform fast, general, and consistent editing.
  • Figure 2: Overview of Omni-3DEdit. Given the instruction and multi-view images as inputs, we first employ Qwen-Image to obtain an edited reference image as condition view. Then an OmniNet is trained to map the editing cues from condition view to other views. The outputs of OmniNet are edited multi-view images, which can be used to obtain the edited 3D asset optionally.
  • Figure 3: Data Construction Pipeline. The original multi-view images are passed through a four-stage pipeline to obtain their paired multi-view counterparts after editing. The pipeline covers tasks of 3D removal, addition, and appearance editing.
  • Figure 4: Qualitative comparisons to 3D removal methods. Our Omni-3DEdit not only removes the specific object completely but also presents rich details in the removed regions compared to other methods. We center crop views to adapt OmniNet resolution (white box).
  • Figure 5: Comparison of 3D appearance editing. Red box represents the reference view. White boxes are object masks as additional inputs for MVInpainter.
  • ...and 2 more figures