Table of Contents
Fetching ...

MultiDiff: Consistent Novel View Synthesis from a Single Image

Norman Müller, Katja Schwarz, Barbara Roessle, Lorenzo Porzi, Samuel Rota Bulò, Matthias Nießner, Peter Kontschieder

TL;DR

A novel approach for consistent novel view synthesis of scenes from a single RGB image, which incorporates strong priors in form of monocular depth predictors and video-diffusion models and naturally supports multi-view consistent editing without the need for further tuning.

Abstract

We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

MultiDiff: Consistent Novel View Synthesis from a Single Image

TL;DR

A novel approach for consistent novel view synthesis of scenes from a single RGB image, which incorporates strong priors in form of monocular depth predictors and video-diffusion models and naturally supports multi-view consistent editing without the need for further tuning.

Abstract

We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

Paper Structure

This paper contains 38 sections, 2 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Given a single input image, MultiDiff synthesizes consistent novel views following a desired camera trajectory. These synthesized views harmonize well even in areas unseen from the reference view. Examples from RealEstate10K realestate46965 (top two rows) and ScanNet dai2017scannet (bottom row) test sets demonstrate that our model can handle large camera changes and challenging perspectives.
  • Figure 2: MultiDiff is pose-conditional diffusion model for novel view synthesis from a single image. The diffusion model is trained in the latent space of a fixed auto-encoder with encoder $\mathcal{E}$ and decoder $\mathcal{D}$ and is conditioned on a reference image $\mathbf{I}_\text{ref}$ and a camera trajectory $\{\mathbf{c}^n\}$. Specifically, we embed $N$ posed target images $\{\mathbf{I}^n\}_{n=1}^N$ into latent space, apply forward diffusion according to a timestep $t$ and structured noise $\{\boldsymbol{\xi}^n\}$, and train a 3D U-Net to predict $\{\boldsymbol{\xi}^n\}$ from the noisy inputs $\{\mathbf{z}_t^n\}$. For each sample $n$, the U-Net’s prediction $\hat{\boldsymbol{\xi}}_t^n$ is used to reconstruct the denoised sample $\hat{\mathbf{z}}_t^n$ which can then be decoded into the predicted target image $\hat{\mathbf{I}}^n$. We condition the U-Net on the reference image by warping $\mathbf{I}_\text{ref}$ to the novel views using depth $\hat{\mathbf{D}}_\text{ref}$ from a pretrained estimator $\phi$. The warps $\{\mathbf{I}_\text{ref}^n\}$ are encoded into latent representations $\{\mathbf{y}_\text{tgt}^n\}$ and injected into the U-Net in a ControlNet inspired manner. We further condition the model directly on the camera pose and an embedding of the reference image as part of the semantic condition $\{\mathbf{y}_{sem}^n$}.
  • Figure 3: Novel views following ground-truth trajectories (right) given a reference view (left) on RealEstate10K. Through our joint multi-frame prediction combined with effective priors and conditioning, our sequence of novel views is highly realistic and view-consistent compared to the baselines, which show severe degradation over time.
  • Figure 4: Generated views along ScanNet dai2017scannet test sequence (right) given a reference view (left). Our method simultaneously generates sequences of novel views that are both more realistic and more view-consistent than the baselines, DFM and PhotoNVS, which suffer from a considerable performance drop across large view point changes. Although MVDiffusion uses sensor depth input, the generated views are much less consistent with the reference image (e.g., colors of the cushions), compared to our generations, which do not rely on sensor depth.
  • Figure 5: Without structured noise ("MultiDiff w/o SN"), the color of the dining table is not maintained w.r.t. the reference image.
  • ...and 10 more figures