Table of Contents
Fetching ...

VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

Hongbo Lu, Liang Yao, Chenghao He, Fan Liu, Wenlong Liao, Tao He, Pai Peng

Abstract

A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift'' strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.

VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

Abstract

A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift'' strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.
Paper Structure (21 sections, 7 equations, 4 figures, 2 tables)

This paper contains 21 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of VisionNVS. Our camera-only framework reformulates novel view synthesis as self-supervised inpainting, trained via Flow Matching with a DiT backbone and the Wan2.2 wan2025wan streaming VAE for efficient encoding. As shown in the bottom data pipeline, we construct a "Pseudo Image" condition by masking the raw image based on virtual shift occlusions and filling gaps with neighbor views. The model (top) then learns to recover the original, artifact-free raw image from this degraded condition.
  • Figure 2: Comparison of computational efficiency. We profile memory usage and inference time under different input lengths $N$.
  • Figure 3: Qualitative comparison of temporal consistency. We visualize consecutive frames ($t-2$ to $t+2$) synthesized by DiST-4D guo2025dist and Ours. The orange and red boxes highlight specific regions of interest. DiST-4D exhibits severe temporal instability, with building structures distorting (top) and tree details flickering or vanishing (bottom) across frames. In contrast, Ours maintains remarkable geometric stability and visual coherence, preserving object details without artifacts.
  • Figure 4: Visual comparison with state-of-the-art DiST-4D guo2025dist. Despite DiST-4D utilizing expensive LiDAR priors, dense metric depth supervision, and complex Cycle Consistency (SCC) strategies, our camera-only VisionNVS achieves comparable visual fidelity and geometric consistency without employing any of these additional tricks.