GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Minjun Kang; Inkyu Shin; Taeyeop Lee; Myungchul Kim; In So Kweon; Kuk-Jin Yoon

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Minjun Kang, Inkyu Shin, Taeyeop Lee, Myungchul Kim, In So Kweon, Kuk-Jin Yoon

Abstract

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Abstract

Paper Structure (30 sections, 10 equations, 25 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 10 equations, 25 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminary
GS-Adapter: Geometry-Guided Diffusion Feature Adaptation
Integration into Video Diffusion
Experiment
Overview
Implementation Details
Synthesis Quality
Geometric Consistency and Camera Controllability
Feature analysis of GS-Adapter
Ablation Studies
Conclusions
Notation Table
...and 15 more sections

Figures (25)

Figure 1: Geometry-guided generative NVS. (a) Pure diffusion model produces view-inconsistent results. Given sparse input-view images, both (b) and (c) reconstruct 3D Gaussians from input views using a geometry prior. (b) Previous methods inject rasterized novel-view images from 3D-GS as input, causing artifacts from noisy rasterized colors. (c) Our method modulates internal diffusion features via a GS-Adapter conditioned on 3D-GS, achieving superior geometry consistency and visual quality.
Figure 2: GeoNVS architecture. (a) Overview of the integration with a video diffusion model. (b) The GS-Adapter pipeline for feature lifting, refinement, and fusion. All learnable modules () are trained with LoRA hu2022lora. During training, a consistency loss $\mathcal{L}_{\text{feat}}$ is applied to preserve geometric detail lost during feature lifting. Please refer to the supplementary material for details of the multi-scale fusion module and RefineNet.
Figure 3: Feature fusion module of GS-Adapter. Two fusion approaches are proposed to integrate the diffusion feature $\mathbf{F}_{\mathrm{tar}}$ and the geometry-aware feature $\tilde{\mathbf{G}}_{\mathrm{tar}}$, producing the updated novel-view feature $\hat{\mathbf{F}}_{\mathrm{tar}}$. We adopt adaptive fusion as it remains effective even when either the geometry prior or the generative model fails.
Figure 4: Feature modulation by GS-Adapter. We visualize intermediate diffusion features during the denoising process. GS-Adapter consists of three stages: (1) lifting reference-view features $\mathbf{F}_\text{ref}^t$ into 3D Gaussians, (2) refining the novel-view features $\mathbf{G}_\text{tar}$ into $\hat{\mathbf{G}}_\text{tar}$, and (3) fusing$\hat{\mathbf{G}}_\text{tar}$ with $\mathbf{F}_\text{tar}^t$ to generate geometry-corrected outputs.
Figure 5: Qualitative results of GeoNVS with SEVA zhou2025stable.
...and 20 more figures

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Abstract

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Authors

Abstract

Table of Contents

Figures (25)