Table of Contents
Fetching ...

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, Yanwei Fu

TL;DR

MVGenMaster tackles the challenge of versatile novel view synthesis by fusing diffusion-based generation with explicit 3D priors derived from metric depth and camera geometry. The method employs a multi-view latent diffusion model with Plücker ray embeddings and warped 3D priors (RGB pixels and CCMs) to synthesize many target views from arbitrary reference views in a single forward pass, anchored by a large-scale MvD-1M dataset. Key innovations include a training-free key-rescaling mechanism to extend view numbers without degradation and targeted training strategies (domain switcher, multi-scale training, EMA) that boost scalability and generalization. Empirical results across in-domain and out-of-domain benchmarks demonstrate state-of-the-art NVS performance with improved 3D consistency, extending practical NVS capabilities toward scene-level content and variable-view generation for applications in graphics and AR/VR.

Abstract

We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset called MvD-1M, comprising up to 1.6 million scenes, equipped with well-aligned metric depth to train MVGenMaster. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at https://github.com/ewrfcas/MVGenMaster/.

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

TL;DR

MVGenMaster tackles the challenge of versatile novel view synthesis by fusing diffusion-based generation with explicit 3D priors derived from metric depth and camera geometry. The method employs a multi-view latent diffusion model with Plücker ray embeddings and warped 3D priors (RGB pixels and CCMs) to synthesize many target views from arbitrary reference views in a single forward pass, anchored by a large-scale MvD-1M dataset. Key innovations include a training-free key-rescaling mechanism to extend view numbers without degradation and targeted training strategies (domain switcher, multi-scale training, EMA) that boost scalability and generalization. Empirical results across in-domain and out-of-domain benchmarks demonstrate state-of-the-art NVS performance with improved 3D consistency, extending practical NVS capabilities toward scene-level content and variable-view generation for applications in graphics and AR/VR.

Abstract

We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset called MvD-1M, comprising up to 1.6 million scenes, equipped with well-aligned metric depth to train MVGenMaster. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at https://github.com/ewrfcas/MVGenMaster/.

Paper Structure

This paper contains 20 sections, 5 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: The proposed MVGenMaster handles various NVS scenarios properly as a master, including (a) NVS based on single-view text-to-image conditions, (b) interpolation between two known views, and (c) flexible NVS with variable reference views and arbitrary target views. MVGenMaster enables all tasks above with a single forward process without sophisticated iterative inference and dataset updating.
  • Figure 2: Overall pipeline of MVGenMaster. Inputs can be categorized into reference views (reference images and related camera poses) and target views (camera poses only). For training, we extract monocular depths from reference views and then align them with SfM to warp CCM and RGB pixels as 3D priors for target views. For inference, we utilize Depth-Pro bochkovskii2024depth (single-view) or Dust3R wang2024dust3r (multi-view) to obtain metric depth.
  • Figure 3: The metric depth alignment process for the training data of MVGenMaster. We achieve the rescale and shift coefficient by RANSAC, and then leverage them to align the monocular depth to metric one with a simple linear variation.
  • Figure 4: Key-rescaling. We employ 3-view references with ambiguous 3D priors as noisy conditions for long sequential generation. Key-rescaling enhances the reference guidance and eliminates attention dilution, resulting in better NVS with mass target views.
  • Figure 5: Qualitative NVS results compared among CAT3D*, ViewCrafter, and our MVGenMaster. The synthesis is based on ($N=1$) reference view and ($M=24$) target views. The leftmost column displays the reference view, while the remaining visualizations are uniformly sampled from the 24-frame generation due to page limitation.
  • ...and 11 more figures