PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing
Hasan Iqbal, Nazmul Karim, Umar Khalid, Azib Farooq, Zichun Zhong, Chen Chen, Jing Hua
TL;DR
PSF-4D tackles 4D scene editing with a diffusion-based framework that enforces temporal and multi-view coherence by progressively controlling forward noise. It introduces an autoregressive temporal noise model (ANM), a cross-view noise model (CNM), and a view-consistent refinement (VCR) plus view-aware position encoding (VPE) to unify edits across frames and views. The method operates end-to-end without external models, achieving superior qualitative and quantitative results on monocular and multi-view 4D datasets, and succeeds across tasks such as style transfer, object removal, and multi-attribute editing. This work demonstrates that principled noise design within diffusion models can robustly drive 4D editing, with practical implications for consistent dynamic scene manipulation in graphics and vision applications.
Abstract
Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.
