DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

Chenghao Gu; Zhenzhe Li; Zhengqi Zhang; Yunpeng Bai; Shuzhao Xie; Zhi Wang

DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

Chenghao Gu, Zhenzhe Li, Zhengqi Zhang, Yunpeng Bai, Shuzhao Xie, Zhi Wang

TL;DR

DragScene tackles the challenge of drag-based 3D scene editing by combining 2D drag edits on a reference view with a cross-view propagation mechanism grounded in coarse 3D clues. The method performs latent optimization on the reference view, reconstructs 3D clues via a point-cloud representation, and propagates edits to other views through multi-view latent optimization, before reconstructing the final scene with a diffusion-friendly 3D representation. The approach demonstrates precise, creative edits with strong multi-view consistency across real-world scenes and diverse 3D representations, outperforming prompt-based and naive 3D-extends baselines. By decoupling 2D drag edits from 3D geometry through 3D clues and latent maps, DragScene offers a practical, extensible framework for interactive 3D editing with potential future integration of language guidance and dynamic scene editing.

Abstract

3D editing has shown remarkable capability in editing scenes based on various instructions. However, existing methods struggle with achieving intuitive, localized editing, such as selectively making flowers blossom. Drag-style editing has shown exceptional capability to edit images with direct manipulation instead of ambiguous text commands. Nevertheless, extending drag-based editing to 3D scenes presents substantial challenges due to multi-view inconsistency. To this end, we introduce DragScene, a framework that integrates drag-style editing with diverse 3D representations. First, latent optimization is performed on a reference view to generate 2D edits based on user instructions. Subsequently, coarse 3D clues are reconstructed from the reference view using a point-based representation to capture the geometric details of the edits. The latent representation of the edited view is then mapped to these 3D clues, guiding the latent optimization of other views. This process ensures that edits are propagated seamlessly across multiple views, maintaining multi-view consistency. Finally, the target 3D scene is reconstructed from the edited multi-view images. Extensive experiments demonstrate that DragScene facilitates precise and flexible drag-style editing of 3D scenes, supporting broad applicability across diverse 3D representations.

DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

TL;DR

Abstract

DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)