Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus
TL;DR
MVGD addresses generalizable novel view synthesis and depth estimation from sparse posed images by learning a single diffusion model that jointly generates RGB images and depth maps from arbitrary numbers of input views without intermediate 3D representations. It employs raymap conditioning and scene scale normalization within a Transformer-based RIN architecture, guided by learnable task embeddings to enable unified multi-task diffusion. The approach achieves state-of-the-art results on multiple novel view benchmarks and excels in multi-view depth estimation (e.g., ScanNet), while presenting an efficient incremental fine-tuning strategy that scales model capacity without retraining from scratch. This work advances practical 3D understanding from multi-view imagery and offers scalable training and conditioning strategies for large heterogeneous datasets.
Abstract
Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.
