Table of Contents
Fetching ...

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Haiping Wang, Yuan Liu, Ziwei Liu, Wenping Wang, Zhen Dong, Bisheng Yang

TL;DR

Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin.

Abstract

In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream begins with building a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin. The code, videos, and interactive demos are available at https://vistadream-project-page.github.io/.

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

TL;DR

Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin.

Abstract

In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream begins with building a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin. The code, videos, and interactive demos are available at https://vistadream-project-page.github.io/.

Paper Structure

This paper contains 19 sections, 3 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Overview. (Top) Given a single-view image of a scene, VistaDream reconstructs a 3D scene represented by 3D Gaussian Splatting (3DGS) gs for novel view synthesis. (Bottom) The proposed Multiview Consistency Sampling (MCS) significantly improves scene quality and achieves better results compared to the commonly used Score Distillation Sampling (SDS) sds.
  • Figure 2: StageI: Coarse Gaussian field reconstruction. (a) Given an image, VistaDream initializes a 3D global scaffold by enlarging FoV and inpainting, then iteratively inpaints the warped RGB-D images to complete a coarse Gaussian field. (b) Without a scaffold, existing models struggle to accurately connect the inpainting regions with the global scene, leading to distortion. A global scaffold provides a reliable constraint across different viewpoints, yielding correct connections between the inpainted areas and scaffold.
  • Figure 3: Multiview Consistency Sampling for Scene Refinement. (a) We optimize the Gaussian field using high-quality, multi-view images regenerated by diffusion models. (2) The key component is the MCS algorithm, which enforces consistency during multi-view optimization. (3) A real case demonstrates that the MCS optimization process can progressively enhance the quality (red box) and consistency (yellow box) of multi-view images. Utilizing multiview images from MCS to optimize the Gaussian field can significantly enhance its quality.
  • Figure 4: Detailed description is vital for inpainting. Compared to (b) empty descriptions or (c) short captions, (d) descriptions from large Vision-Language Models are more detailed, significantly enhancing the reliability of inpainting.
  • Figure 5: Qualitative comparisons between RealmDreamer realmdreamer and our method. Given (a) a single input image, both (b) RealmDreamer and (c) VistaDream (Ours) reconstruct the corresponding 3D Gaussian Field through a coarse-to-fine strategy. In the third column of each method, we visualize a mixture of rendered images and depth maps of the scene.
  • ...and 11 more figures