Table of Contents
Fetching ...

NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer

Meng You, Zhiyu Zhu, Hui Liu, Junhui Hou

TL;DR

NVS-Solver presents a training-free approach to novel view synthesis by steering pre-trained video diffusion models with scene priors derived from warped input views. It introduces adaptive score modulation through a theoretically grounded, error-bound-based lambda to balance diffusion guidance and view-consistency, enabling high-fidelity NVS from single, multi-view, or monocular video inputs. Extensive experiments on static and dynamic scenes demonstrate state-of-the-art performance in both visual quality and pose accuracy, with ablations validating the core components. The method broadens zero-shot capabilities in view synthesis and suggests avenues for integrating diffusion-based generative models with explicit camera-geometry constraints.

Abstract

By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS) paradigm that operates \textit{without} the need for training. NVS-Solver adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the significant superiority of our NVS-Solver over state-of-the-art methods both quantitatively and qualitatively. \textit{ Source code in } \href{https://github.com/ZHU-Zhiyu/NVS_Solver}{https://github.com/ZHU-Zhiyu/NVS$\_$Solver}.

NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer

TL;DR

NVS-Solver presents a training-free approach to novel view synthesis by steering pre-trained video diffusion models with scene priors derived from warped input views. It introduces adaptive score modulation through a theoretically grounded, error-bound-based lambda to balance diffusion guidance and view-consistency, enabling high-fidelity NVS from single, multi-view, or monocular video inputs. Extensive experiments on static and dynamic scenes demonstrate state-of-the-art performance in both visual quality and pose accuracy, with ablations validating the core components. The method broadens zero-shot capabilities in view synthesis and suggests avenues for integrating diffusion-based generative models with explicit camera-geometry constraints.

Abstract

By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS) paradigm that operates \textit{without} the need for training. NVS-Solver adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the significant superiority of our NVS-Solver over state-of-the-art methods both quantitatively and qualitatively. \textit{ Source code in } \href{https://github.com/ZHU-Zhiyu/NVS_Solver}{https://github.com/ZHU-Zhiyu/NVSSolver}.
Paper Structure (29 sections, 29 equations, 24 figures, 11 tables, 1 algorithm)

This paper contains 29 sections, 29 equations, 24 figures, 11 tables, 1 algorithm.

Figures (24)

  • Figure 2: Experimental observations of the relationship (a) between the diffusion estimation error $\mathcal{E}_D$ and the noise level $\sigma_t$ and (b) between the error of warped image $\mathcal{E}_T$ and the changed amount of view pose $\|\Delta\mathbf{p}\|_2$.
  • Figure 3: Visual comparison of single view-based NVS results by (a) Text2Nerf zhang2024text2nerf, (b) 3D-aware xiang20233d, (c) MotionCtrl wang2023motionctrl, (d) Ours (Post). The middle view of each scene highlighted with the red rectangle refers to the input view. Here, we only show the results of the best two of all compared methods. We also refer reviewers to the Appendix \ref{['Appendix:visual']} and video demo contained in the supplementary file for more impressive results and comparisons.
  • Figure 4: (a) The two input views of each scene highlighted with the red rectangle. Visual results of multiview-based NVS by (b) 3D-aware xiang20233d, (c) MotionCtrl wang2023motionctrl, (d) Ours (Post).
  • Figure 5: The input views of each scene highlighted with the red rectangle. Visual results of synthesized 360° NVS from (a) single view and (b) multi-view input.
  • Figure 6: Visual comparison on dynamic scene view synthesis of (a) input frames in the corresponding time of generated images, (b) Deformable-Gaussian yang2023deformable3dgs, (c) 4D-Gaussian wu20234dgaussians, (d) 3D-aware xiang20233d, (e) MotionCtrl wang2023motionctrl, (f) Ours (Post).
  • ...and 19 more figures