Table of Contents
Fetching ...

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Minjun Kang, Inkyu Shin, Taeyeop Lee, Myungchul Kim, In So Kweon, Kuk-Jin Yoon

Abstract

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Abstract

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.
Paper Structure (30 sections, 10 equations, 25 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 10 equations, 25 figures, 12 tables, 1 algorithm.

Figures (25)

  • Figure 1: Geometry-guided generative NVS. (a) Pure diffusion model produces view-inconsistent results. Given sparse input-view images, both (b) and (c) reconstruct 3D Gaussians from input views using a geometry prior. (b) Previous methods inject rasterized novel-view images from 3D-GS as input, causing artifacts from noisy rasterized colors. (c) Our method modulates internal diffusion features via a GS-Adapter conditioned on 3D-GS, achieving superior geometry consistency and visual quality.
  • Figure 2: GeoNVS architecture. (a) Overview of the integration with a video diffusion model. (b) The GS-Adapter pipeline for feature lifting, refinement, and fusion. All learnable modules () are trained with LoRA hu2022lora. During training, a consistency loss $\mathcal{L}_{\text{feat}}$ is applied to preserve geometric detail lost during feature lifting. Please refer to the supplementary material for details of the multi-scale fusion module and RefineNet.
  • Figure 3: Feature fusion module of GS-Adapter. Two fusion approaches are proposed to integrate the diffusion feature $\mathbf{F}_{\mathrm{tar}}$ and the geometry-aware feature $\tilde{\mathbf{G}}_{\mathrm{tar}}$, producing the updated novel-view feature $\hat{\mathbf{F}}_{\mathrm{tar}}$. We adopt adaptive fusion as it remains effective even when either the geometry prior or the generative model fails.
  • Figure 4: Feature modulation by GS-Adapter. We visualize intermediate diffusion features during the denoising process. GS-Adapter consists of three stages: (1) lifting reference-view features $\mathbf{F}_\text{ref}^t$ into 3D Gaussians, (2) refining the novel-view features $\mathbf{G}_\text{tar}$ into $\hat{\mathbf{G}}_\text{tar}$, and (3) fusing$\hat{\mathbf{G}}_\text{tar}$ with $\mathbf{F}_\text{tar}^t$ to generate geometry-corrected outputs.
  • Figure 5: Qualitative results of GeoNVS with SEVA zhou2025stable.
  • ...and 20 more figures