Table of Contents
Fetching ...

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

Mohamed el amine Boudjoghra, Ivan Laptev, Angela Dai

TL;DR

ScanEdit addresses editing of complex real-world 3D scans by translating high-level natural language edits into tractable per-object actions via a hierarchically guided, multi-stage pipeline. It constructs a hierarchical scene graph, uses LLMs to identify relevant subgraphs and generate localized plans, and then performs hierarchical object placement followed by joint optimization that enforces both semantic intent and physical feasibility. The approach combines LLM-based scene constraints with differentiable 3D losses to avoid collisions and boundary violations while preserving overall scene structure. Experiments on ScanNet++ and Replica demonstrate state-of-the-art geometric plausibility and perceptual alignment with prompts, enabling scalable editing for scenes with hundreds of objects.

Abstract

With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of ob- jectswe propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reason- ing capabilities of Large Language Models (LLMs) and translate high-level language instructions into actionable commands applied hierarchically to the scene graph. Fi- nally, ScanEdit integrates LLM-based guidance with ex- plicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent re- sults for a variety of real-world scenes and input instruc- tions.

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

TL;DR

ScanEdit addresses editing of complex real-world 3D scans by translating high-level natural language edits into tractable per-object actions via a hierarchically guided, multi-stage pipeline. It constructs a hierarchical scene graph, uses LLMs to identify relevant subgraphs and generate localized plans, and then performs hierarchical object placement followed by joint optimization that enforces both semantic intent and physical feasibility. The approach combines LLM-based scene constraints with differentiable 3D losses to avoid collisions and boundary violations while preserving overall scene structure. Experiments on ScanNet++ and Replica demonstrate state-of-the-art geometric plausibility and perceptual alignment with prompts, enabling scalable editing for scenes with hundreds of objects.

Abstract

With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of ob- jectswe propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reason- ing capabilities of Large Language Models (LLMs) and translate high-level language instructions into actionable commands applied hierarchically to the scene graph. Fi- nally, ScanEdit integrates LLM-based guidance with ex- plicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent re- sults for a variety of real-world scenes and input instruc- tions.

Paper Structure

This paper contains 46 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: ScanEdit enables instruction-driven editing of complex, real-world scenes by rearranging their 3D scans. Given an input 3D scan and its object-level decomposition, we formulate a hierarchically-guided, multi-stage LLM-based approach that transforms high-level user instructions into concrete, tractable local instructions for objects, which can then be globally optimized to achieve the functional instruction. In this example, ScanEdit rearranges chairs, tables and a coffee machine to create a coffee drinking area.
  • Figure 2: Overview of ScanEdit. Given an input instruction $\mathcal{I}$ for a scene mesh $\mathcal{M}$ that has an instance decomposition and is reconstructed from RGB-D sequence $\mathcal{R}$, we output an edited scene according to the instruction $\mathcal{I}$. We first construct a hierarchical scene graph $\mathcal{G}$ using 3D and VLM reasoning to annotate graph node and edge attributes. Since this graph may be very large in size, we then identify the relevant subgraph $\mathcal{G}_s$ for $\mathcal{I}$. Our planner then breaks down the high-level instruction $\mathcal{I}$ into low-level object instructions, validates them, and creates an instruction queue. We traverse the instruction queue hierarchically in order to place objects as initialization for the new output scene, followed by a scene optimization over both LLM-generated scene constraints as well as physical 3D collision constraints, to produce the output edited scene. In this example, the two vases on the top of the bookshelf are moved to the table in the 3D scan. Note that since real-world 3D scans are partial, some holes can be visible (e.g., in the output wall) after re-arranging objects.
  • Figure 3: Qualitative comparison with baselines LayoutGPT feng2024layoutgpt and LayoutVLM sun2024layoutvlm. Red circles denote strong geometric errors (large collisions, out-of-boundary). Our method shows strong performance in adhering to the instruction while achieving physical plausibility.
  • Figure 4: Ablation visualization over loss components. Our final loss with all components produces physically plausible results that avoid collisions and out-of-boundary objects.
  • Figure 5: Our perceptual study shows that users strongly prefer our method compared to baselines LayoutGPT feng2024layoutgpt and LayoutVLM sun2024layoutvlm, in both of adherence to text instruction (AT) and layout quality of the edited scene (QL).
  • ...and 2 more figures