Table of Contents
Fetching ...

DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

Chenghao Gu, Zhenzhe Li, Zhengqi Zhang, Yunpeng Bai, Shuzhao Xie, Zhi Wang

TL;DR

DragScene tackles the challenge of drag-based 3D scene editing by combining 2D drag edits on a reference view with a cross-view propagation mechanism grounded in coarse 3D clues. The method performs latent optimization on the reference view, reconstructs 3D clues via a point-cloud representation, and propagates edits to other views through multi-view latent optimization, before reconstructing the final scene with a diffusion-friendly 3D representation. The approach demonstrates precise, creative edits with strong multi-view consistency across real-world scenes and diverse 3D representations, outperforming prompt-based and naive 3D-extends baselines. By decoupling 2D drag edits from 3D geometry through 3D clues and latent maps, DragScene offers a practical, extensible framework for interactive 3D editing with potential future integration of language guidance and dynamic scene editing.

Abstract

3D editing has shown remarkable capability in editing scenes based on various instructions. However, existing methods struggle with achieving intuitive, localized editing, such as selectively making flowers blossom. Drag-style editing has shown exceptional capability to edit images with direct manipulation instead of ambiguous text commands. Nevertheless, extending drag-based editing to 3D scenes presents substantial challenges due to multi-view inconsistency. To this end, we introduce DragScene, a framework that integrates drag-style editing with diverse 3D representations. First, latent optimization is performed on a reference view to generate 2D edits based on user instructions. Subsequently, coarse 3D clues are reconstructed from the reference view using a point-based representation to capture the geometric details of the edits. The latent representation of the edited view is then mapped to these 3D clues, guiding the latent optimization of other views. This process ensures that edits are propagated seamlessly across multiple views, maintaining multi-view consistency. Finally, the target 3D scene is reconstructed from the edited multi-view images. Extensive experiments demonstrate that DragScene facilitates precise and flexible drag-style editing of 3D scenes, supporting broad applicability across diverse 3D representations.

DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

TL;DR

DragScene tackles the challenge of drag-based 3D scene editing by combining 2D drag edits on a reference view with a cross-view propagation mechanism grounded in coarse 3D clues. The method performs latent optimization on the reference view, reconstructs 3D clues via a point-cloud representation, and propagates edits to other views through multi-view latent optimization, before reconstructing the final scene with a diffusion-friendly 3D representation. The approach demonstrates precise, creative edits with strong multi-view consistency across real-world scenes and diverse 3D representations, outperforming prompt-based and naive 3D-extends baselines. By decoupling 2D drag edits from 3D geometry through 3D clues and latent maps, DragScene offers a practical, extensible framework for interactive 3D editing with potential future integration of language guidance and dynamic scene editing.

Abstract

3D editing has shown remarkable capability in editing scenes based on various instructions. However, existing methods struggle with achieving intuitive, localized editing, such as selectively making flowers blossom. Drag-style editing has shown exceptional capability to edit images with direct manipulation instead of ambiguous text commands. Nevertheless, extending drag-based editing to 3D scenes presents substantial challenges due to multi-view inconsistency. To this end, we introduce DragScene, a framework that integrates drag-style editing with diverse 3D representations. First, latent optimization is performed on a reference view to generate 2D edits based on user instructions. Subsequently, coarse 3D clues are reconstructed from the reference view using a point-based representation to capture the geometric details of the edits. The latent representation of the edited view is then mapped to these 3D clues, guiding the latent optimization of other views. This process ensures that edits are propagated seamlessly across multiple views, maintaining multi-view consistency. Finally, the target 3D scene is reconstructed from the edited multi-view images. Extensive experiments demonstrate that DragScene facilitates precise and flexible drag-style editing of 3D scenes, supporting broad applicability across diverse 3D representations.

Paper Structure

This paper contains 17 sections, 7 equations, 9 figures.

Figures (9)

  • Figure 1: Results of DragScene. DragScene successfully enables drag-based editing for 3D scenes. By following user-provided editing instructions (masks and points), our model seamlessly performs drag-style editing on the original 3D scene. All the results presented above are based on 3D Gaussian Splatting (3DGS), demonstrating natural, creative, and view-consistent edits.
  • Figure 2: Our Motivation. Comparison of DragScene and Directly applying DragDiffusion to multi-view images. (a) illustrates that existing 3D editing methods fail to solve the specific editing task and DragScene performs well. (b) PCA visualization of Unet feature maps during the diffusion process. It demonstrates that directly applying 2D drag-style methods to multi-view images produces inconsistent features, whereas DragScene maintains multi-view feature consistency throughout the diffusion process.
  • Figure 3: Overview of DragScene. Our approach consists of three steps: firstly, we apply a 2D drag-based diffusion model to edit the reference image and obtain the reference latent representation through DDIM inversion. Second, we perform consistent construction of the reference latent representation to obtain 3D latent maps. Finally, we apply the Inversion process to other views, further optimizing the images in latent space with the reconstructed 3D latent maps.
  • Figure 4: Consistent Reconstruction of Latent Representations. To facilitate consistent multi-view latent optimization, we apply DUSt3R to reconstruct the coarse 3D point cloud, with aligned masks assisting in the optimization process. The latent representation of the reference image is assigned to the point cloud to obtain the 3D latent maps.
  • Figure 5: More results of DragScene. We present various views of both the original and edited scenes. All scenes are reconstructed using 3D Gaussian splatting.
  • ...and 4 more figures