Table of Contents
Fetching ...

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus

TL;DR

MVGD addresses generalizable novel view synthesis and depth estimation from sparse posed images by learning a single diffusion model that jointly generates RGB images and depth maps from arbitrary numbers of input views without intermediate 3D representations. It employs raymap conditioning and scene scale normalization within a Transformer-based RIN architecture, guided by learnable task embeddings to enable unified multi-task diffusion. The approach achieves state-of-the-art results on multiple novel view benchmarks and excels in multi-view depth estimation (e.g., ScanNet), while presenting an efficient incremental fine-tuning strategy that scales model capacity without retraining from scratch. This work advances practical 3D understanding from multi-view imagery and offers scalable training and conditioning strategies for large heterogeneous datasets.

Abstract

Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

TL;DR

MVGD addresses generalizable novel view synthesis and depth estimation from sparse posed images by learning a single diffusion model that jointly generates RGB images and depth maps from arbitrary numbers of input views without intermediate 3D representations. It employs raymap conditioning and scene scale normalization within a Transformer-based RIN architecture, guided by learnable task embeddings to enable unified multi-task diffusion. The approach achieves state-of-the-art results on multiple novel view benchmarks and excels in multi-view depth estimation (e.g., ScanNet), while presenting an efficient incremental fine-tuning strategy that scales model capacity without retraining from scratch. This work advances practical 3D understanding from multi-view imagery and offers scalable training and conditioning strategies for large heterogeneous datasets.

Abstract

Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

Paper Structure

This paper contains 25 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: MVGD is a state-of-the-art method that generates images and scale-consistent depth maps from novel viewpoints given an arbitrary number of posed input views. In the above, red cameras are used as conditioning to directly generate RGB-D predictions from green cameras. To highlight the multi-view consistency of our method, predicted colored pointclouds from all novel viewpoints are stacked together for visualization without any post-processing. More examples and videos can be found in https://mvgd.github.io/
  • Figure 2: Diagram of our proposed Multi-View Geometric Diffusion (MVGD) framework, at inference time.$N$ input images $\mathbf{I}_c^n$ with cameras $\mathcal{C}_c^n$ are used for scene conditioning, and a different camera $\mathcal{C}_t$ is selected for novel view and depth synthesis.
  • Figure 3: MVGD novel view and depth synthesis results randomly sampled from different evaluation benchmarks and in-the-wild datasets. Top images are conditioning views (colored cameras), and bottom images are the target view (black camera), showing from left-to-right: ground-truth image, predicted image, and predicted depth map. These predictions are used to produce a colored 3D pointcloud observed from the target view. For more examples and additional visualizations, please refer to the supplementary material.
  • Figure 4: Zero-Shot MVGD novel view and depth synthesis results randomly sampled from different evaluation benchmarks and in-the-wild datasets. Top left images are conditioning views (colored cameras), and bottom images are the target view (black camera), showing from left-to-right: ground-truth image, predicted image, and predicted depth map. These predictions are used to produce a colored 3D pointcloud observed from the target viewpoint.
  • Figure 5: Accumulated MVGD pointclouds, obtained by generating novel images and depth maps from various viewpoints (black cameras), using the same conditioning views (colored cameras), and stacking them together without any post-processing. Our zero-shot architecture is capable of directly generating multi-view consistent predictions that match the scale from conditioning cameras.
  • ...and 4 more figures