Table of Contents
Fetching ...

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang

Abstract

Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

Abstract

Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
Paper Structure (77 sections, 4 equations, 24 figures, 9 tables, 1 algorithm)

This paper contains 77 sections, 4 equations, 24 figures, 9 tables, 1 algorithm.

Figures (24)

  • Figure 1: Overview of Edit-As-Act. Step 1: an LLM converts a source scene $S_0$ and instruction into symbolic goal predicates $G_T$ in EditLang. Step 2: a planner–validator loop iteratively selects EditLang actions $a_t$ that satisfy goals $G_t$ and regresses remaining goals until all are grounded in $S_0$. Step 3: the resulting action sequence is applied to $S_0$ to obtain the edited scene $S_T$.
  • Figure 2: Representative qualitative results. Baseline methods often introduce unintended global changes, fail to satisfy multi-step instructions, or generate incomplete edits. Edit-As-Act produces precise, instruction-aligned modifications that remain physically valid and preserve the overall scene identity.
  • Figure 3: Effect of removing EditLang, which provides explicit preconditions that allow the chair to be rotated around the table.
  • Figure 4: User study results. Ten participants rated edited scenes produced by Edit-As-Act, ArtiScene, and AnyHome on three criteria. Edit-As-Act obtains the highest perceived instruction fidelity, semantic consistency, and physical plausibility.
  • Figure 5: LLMs demonstrate strong capabilities in interpreting existing 3D layouts (top) but remain unreliable when directly generating 3D layouts from instructions (bottom). This asymmetry motivates our goal-regressive formulation.
  • ...and 19 more figures