Table of Contents
Fetching ...

TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

Jiyuan Hu, Zechuan Zhang, Zongxin Yang, Yi Yang

Abstract

We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio--such as local pose shifting or component replacemen--while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase--the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio--to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.

TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

Abstract

We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio--such as local pose shifting or component replacemen--while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase--the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio--to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.

Paper Structure

This paper contains 23 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Representative showcases of TRACE. Top: Interactive edits on the "Stone Horse Statue" scene, demonstrating part-level geometric manipulation. Bottom: Diverse editing examples including object addition, texture synthesis, and style transfer.
  • Figure 2: Limitations of existing 3D editing approaches.(a) Geometric Instability: Methods with weak 3D grounding struggle with multi-view consistency (left) and fail at structural modifications (right). (b) Inflexibility and Misalignment: Explicit geometry approaches suffer from texture and lighting mismatches (left) and generate inflexible 3D priors that misalign with specific task (right).
  • Figure 3: Method overview. Our pipeline starts with (a) Multi-view 3D-Anchor Synthesis, which generates geometrically-aligned reference views through 3D-LoRA and VLM guidance. These views are fed into (b) Tangible Geometry Alignment (TGA), where LRM-generated meshes are anchored in the scene via a two-stage alignment module. Subsequently, (c) Contextual Video Masking (CVM) propagates these edits across continuous camera trajectories with context-aware masking. The final 3D scene is reconstructed and optimized using the complete set of edited videos (d).
  • Figure 4: Overview of MV-TRACE dataset curation pipeline.(Top) The curation pipeline begins with 3D asset creation/retrieval and human-assisted spatial alignment to enhance the 3D consistency of editing models and mitigate editing artifacts. We then perform dense spherical view sampling (96 views per scene) and filter view pairs to remove occluded frames or those without the edited object. A final color & contact refinement stage improves visual integration. (Bottom) A qualitative gallery of dataset samples. Results are shown before the color refinement stage, so color compatibility may still be imperfect, while geometric consistency is preserved.
  • Figure 5: Alignment pipeline. To edit the scene by adding sunglasses to the man's face, the initial 3D asset is generated without orientation, resulting in severe misalignment and a reversed heading (see "Before Alignment"). Directly optimizing such a state would lead to a local optimum where the asset is oriented backwards. To resolve this, we employ a two-stage strategy as described in \ref{['subsec:two_stage_align']}. The bottom row demonstrates the progressive transition from a disjointed state to an accurately aligned result that is highly consistent with the reference views.
  • ...and 4 more figures