Table of Contents
Fetching ...

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Chun-Han Yao, Yiming Xie, Vikram Voleti, Huaizu Jiang, Varun Jampani

TL;DR

SV4D 2.0 tackles the challenge of turning a monocular video into high‑quality dynamic 3D assets by unifying multi‑view video synthesis and 4D optimization within a diffusion framework. It introduces a 3D‑aware network using 3D attention, a random masking strategy to remove dependency on reference views, and a progressive 3D‑to‑4D training schedule coupled with a two‑stage 4D optimization that uses visibility‑weighted photogrammetry. Empirical results show consistent gains in detail, spatio‑temporal consistency, and robustness to occlusion across NVVS and 4D generation on synthetic and real data, with strong user study preference. The approach offers practical benefits, including efficient NVVS inference and elimination of rigid reliance on auxiliary multi‑view models, making it a strong foundation for high‑fidelity 4D asset creation.

Abstract

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D. Project page: https://sv4d20.github.io.

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

TL;DR

SV4D 2.0 tackles the challenge of turning a monocular video into high‑quality dynamic 3D assets by unifying multi‑view video synthesis and 4D optimization within a diffusion framework. It introduces a 3D‑aware network using 3D attention, a random masking strategy to remove dependency on reference views, and a progressive 3D‑to‑4D training schedule coupled with a two‑stage 4D optimization that uses visibility‑weighted photogrammetry. Empirical results show consistent gains in detail, spatio‑temporal consistency, and robustness to occlusion across NVVS and 4D generation on synthetic and real data, with strong user study preference. The approach offers practical benefits, including efficient NVVS inference and elimination of rigid reliance on auxiliary multi‑view models, making it a strong foundation for high‑fidelity 4D asset creation.

Abstract

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D. Project page: https://sv4d20.github.io.

Paper Structure

This paper contains 30 sections, 2 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: SV4D 2.0 generates multiple novel-view videos from an input monocular video. The generated novel-view videos have high fidelity in terms of detail sharpness and consistency across view and time/frame axes, which can be used to optimize high-quality 4D assets.
  • Figure 2: Stable Video 4D 2.0 (SV4D 2.0) network architecture. Our model is similar to SV4D xie2024sv4d with several key differences: 1) we randomly mask the reference multi-view latents for cross-attention conditioning, allowing the model to generate multi-view videos without dependency of a separate multi-view diffusion model; 2) we replace view attention with 3D attention and condition it on the camera poses relative to the input view, making it more robust to arbitrary and sparse novel views; 3) we design an $\alpha$-blending strategy for both 3D and frame attention layers to merge spatial and temporal information effectively while preserving priors from multi-view and video diffusion models, which also enables joint training on 3D and 4D data.
  • Figure 3: 4D optimization overview. In the first stage, we use the initial synthesized multi-view videos as pseudo ground-truths to optimization a dynamic NeRF. To handle the 3D inconsistency and pose misalignment in novel view synthesis, we propose a second-stage refinement by noising and denoising the renders of dynamic NeRF as enhanced (3D consistent) photogrammetry targets. We also propose a visibility weighting scheme for the reconstruction losses to mitigate inconsistent texture across views.
  • Figure 4: Detailed analyses of our 4D optimization strategies. (a) Our Stage-2 refinement can effectively reduce the artifacts in dynamic NeRF caused by inconsistent novel-view video synthesis (NVVS). (b) We compute soft visibility maps as view-dependent loss weights based on surface normal estimates to further mitigate texture inconsistency. (c) The proposed progressive frame and orthogonal view sampling are shown to facilitate the learning of temporal deformation and capture better details in motion.
  • Figure 5: Visual Comparison of Novel View Video Synthesis Results. We show two frames in the input videos and two novel-view results of the corresponding frames. Compared to the baseline methods, SV4D 2.0 outputs contain geometry and texture details that are more faithful to the input video and consistent across frames. We also refer reviewers to the Supplemental Material for the video comparison.
  • ...and 10 more figures