Table of Contents
Fetching ...

Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

Dong In Lee, Hyungjun Doh, Seunggeun Chi, Runlin Duan, Sangpil Kim, Karthik Ramani

TL;DR

<3-5 sentence high-level summary> Dynamic-eDiTor presents a training-free framework for text-driven editing of pre-trained 4D Gaussian Splatting models by integrating a grid-based spatio-temporal propagation scheme. It introduces Spatio-Temporal Sub-Grid Attention (STGA) for localized cross-view and temporal fusion and Context Token Propagation (CTP) for global propagation via token inheritance and flow-guided replacement, enabling globally consistent edits without per-scene finetuning. The method directly optimizes the 4DGS after editing the 1 FPS frames and demonstrates superior editing fidelity and spatio-temporal coherence on the DyNeRF dataset, outperforming state-of-the-art baselines in semantic alignment and motion stability. Limitations include reduced efficacy for drastic geometric/topology edits, suggesting future work to extend capabilities beyond propagation-based edits.

Abstract

Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing. Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS. Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: https://di-lee.github.io/dynamic-eDiTor/

Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

TL;DR

<3-5 sentence high-level summary> Dynamic-eDiTor presents a training-free framework for text-driven editing of pre-trained 4D Gaussian Splatting models by integrating a grid-based spatio-temporal propagation scheme. It introduces Spatio-Temporal Sub-Grid Attention (STGA) for localized cross-view and temporal fusion and Context Token Propagation (CTP) for global propagation via token inheritance and flow-guided replacement, enabling globally consistent edits without per-scene finetuning. The method directly optimizes the 4DGS after editing the 1 FPS frames and demonstrates superior editing fidelity and spatio-temporal coherence on the DyNeRF dataset, outperforming state-of-the-art baselines in semantic alignment and motion stability. Limitations include reduced efficacy for drastic geometric/topology edits, suggesting future work to extend capabilities beyond propagation-based edits.

Abstract

Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing. Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS. Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: https://di-lee.github.io/dynamic-eDiTor/

Paper Structure

This paper contains 38 sections, 13 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: We propose Dynamic-eDiTor enables flexible and high-quality editing of pre-trained 4D Gaussian Splatting wu20244d models leveraging Multimodal Diffusion Transformer esser2024scalingwu2025qwen guided solely by text instructions. Through its design focused on both multi-view and temporal consistency, our approach demonstrates robust performance, producing realistic and fine-grained 4D scene manipulation.
  • Figure 2: Dynamic-eDiTor Overview. We represent the multi-view video as a unified camera–time grid. Dynamic-eDiTor combines Spatio-Temporal Sub-Grid Attention (STGA), which performs locally coherent cross-view and temporal fusion within each sub-grid, with Context Token Propagation (CTP), which globally propagates the aggregated features across the grid via Full Token Inheritance and Flow-guided Token Replacement for robust spatio-temporal consistency enforcement. Together, these modules enable seamless, globally consistent multi-view video editing without additional training, while directly optimizing the pre-trained 4DGS.
  • Figure 3: Vital Layer Range Analysis. We analyze the impact of applying Spatio-Temporal Sub-Grid Attention (STGA) across different layer ranges in MM-DiT wu2025qwenesser2024scaling during the multi-view video editing process. Performance is evaluated by temporal consistency (Warping Errorlai2018learning), multi-view consistency (MEt3Rasim2025met3r), and editing fidelity (CLIP Text-Image Directional Similarity radford2021learning). Applying STGA to the early $\sim$30 layers provides the best trade-off between consistency and editing fidelity.
  • Figure 4: Qualitative Comparison. Dynamic-eDiTor enables more robust non-rigid content manipulation and achieves more complete edits of the 4D scene. The top-row displays the original rendered frames, while the following rows show the edited 4DGS renderings produced by each baseline. Our method (bottom-row) outperforms all baselines in both text alignment and overall editing fidelity, while maintaining strong temporal and spatial consistency.
  • Figure 5: Qualitative Ablation Results. The model lacking both components (top-left) suffers from severe artifacts and geometric drift. Adding only STGA or only CTP progressively improves the result, but still leaves residual motion blur and geometric drift. Our full method (bottom-right) successfully ensuring the spatio-temporal consistency to produce a stable and complete edit.
  • ...and 8 more figures