Table of Contents
Fetching ...

PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

Hasan Iqbal, Nazmul Karim, Umar Khalid, Azib Farooq, Zichun Zhong, Chen Chen, Jing Hua

TL;DR

PSF-4D tackles 4D scene editing with a diffusion-based framework that enforces temporal and multi-view coherence by progressively controlling forward noise. It introduces an autoregressive temporal noise model (ANM), a cross-view noise model (CNM), and a view-consistent refinement (VCR) plus view-aware position encoding (VPE) to unify edits across frames and views. The method operates end-to-end without external models, achieving superior qualitative and quantitative results on monocular and multi-view 4D datasets, and succeeds across tasks such as style transfer, object removal, and multi-attribute editing. This work demonstrates that principled noise design within diffusion models can robustly drive 4D editing, with practical implications for consistent dynamic scene manipulation in graphics and vision applications.

Abstract

Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.

PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

TL;DR

PSF-4D tackles 4D scene editing with a diffusion-based framework that enforces temporal and multi-view coherence by progressively controlling forward noise. It introduces an autoregressive temporal noise model (ANM), a cross-view noise model (CNM), and a view-consistent refinement (VCR) plus view-aware position encoding (VPE) to unify edits across frames and views. The method operates end-to-end without external models, achieving superior qualitative and quantitative results on monocular and multi-view 4D datasets, and succeeds across tasks such as style transfer, object removal, and multi-attribute editing. This work demonstrates that principled noise design within diffusion models can robustly drive 4D editing, with practical implications for consistent dynamic scene manipulation in graphics and vision applications.

Abstract

Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.

Paper Structure

This paper contains 28 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Examples of 4D editing tasks with our approach, covering Local Editing, Style Transfer, Object Removal, and Multi-Attribute Editing. Local Editing: Transforms objects like a teddy bear into a Tiger Teddy and a cat into a Chow Chow Dog. Style Transfer: Applies artistic styles, such as Fauvism Painting, across frames, maintaining visual coherence. Object Removal: Eliminates objects (e.g., big lego blocks, dog) while preserving background consistency. Multi-Attribute Editing: Combines edits, such as changing hair to blue and clothing to silver, or adding attributes like "Blue jeans, black seat, and green walls. These examples demonstrate our model’s ability to perform complex 4D edits with spatial and temporal consistency across various scenarios.
  • Figure 2: PSF-4D framework is designed for text-guided 4D editing. Right. We introduce a progressive noise sampling method for noise initialization, consisting of two key stages: (i) an autoregressive noise model to ensure temporal consistency and (ii)cross-view noise control to maintain spatial coherence. Left. This technique is incorporated into the diffusion process of the text-to-video (T2V) editing model, enabling the generation of 4D scenes with spatio-temporal coherence across multiple views. We further refine the edited 4D scene by enforcing a view-consistent refinement strategy. Note that we consider this refinement process only after constructing the initial edited 4D model, i.e. $l >= 1$ (Sec. \ref{['sec:view_consistency']}). After the initialization step ($l = 0$), we do not consider the original rendered latents anymore; only edited rendered latents have a role in next stages ($l >= 1$).
  • Figure 3: Qualitative 4D Editing Results. Examples of 4D editing tasks performed using our PSF-4D framework. Each row represents a specific editing scenario, demonstrating the versatility and precision of PSF-4D across a variety of tasks, including style transfer, object transformation, and attribute modification. From transforming a scene into different artistic styles (e.g., "Make it a Fauvism painting," "Make him Van Gogh style") to specific object edits (e.g., "Give him blue trousers and brown leather bag"), PSF-4D maintains consistency and coherence across frames in dynamic 4D scenes.
  • Figure 4: Object Removal in 4D Scenes. Examples of object removal across various scenes from different datasets, including DyNeRF, DyCheck, and HyperNeRF. Each row illustrates the original 4D scene followed by frames with specific objects removed, as per the editing prompt. Prompts such as “Delete Person", “Delete dog", “Remove paper windmill and flowerpot,” “Remove red egg", “Remove cookie", and “Remove napkin” demonstrate the capability of our method to accurately and seamlessly edit out targeted objects while preserving the surrounding scene consistency.
  • Figure 5: Comparison of 4D Editing Results. Examples of 4D scene editing using our approach compared to Instruct 4D-to-4D (I4Dto4D) and the original 4D scene across various scenes in the DyNeRF dataset. Each column presents a different editing prompt: DyNeRF_Cut_Beef (“Give his clothes blue color”), DyNeRF_Coffee_Martini (“Turn him into bronze statue”), DyNeRF_Flame_Salmon (“Turn him into Van Gogh’s Painting”), and DyNeRF_Cook_Spinach (“Turn him into silver statue”). The original 4D scenes (top row) show unedited content, while the I4D-to-4D mou2024instruct results (middle row) illustrate partial modifications. Our approach (bottom row) achieves more precise and consistent adherence to the editing prompts across all frames, producing visually coherent and realistic transformations.
  • ...and 2 more figures