Table of Contents
Fetching ...

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

Emmanuelle Bourigault, Pauline Bourigault

TL;DR

MVDiff tackles the problem of inconsistent multi-view synthesis for single-view 3D reconstruction by combining a scene representation transformer with a view-conditioned latent diffusion model guided by epipolar geometry. The approach enforces 3D consistency via epipolar-attention and multi-view self-attention, enabling generation of multiple coherent views and subsequent 3D mesh reconstruction from few inputs. Key contributions include implicit 3D representation learning with geometric cues, a scalable diffusion-based framework, and demonstrated improvements on the GSO dataset in both novel view synthesis (PSNR, SSIM, LPIPS) and 3D reconstruction metrics. The work offers a practical pathway to fast, view-consistent 3D generation from minimal input, with potential applications in VR/AR, robotics, and synthetic data generation for medical visualization.

Abstract

Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

TL;DR

MVDiff tackles the problem of inconsistent multi-view synthesis for single-view 3D reconstruction by combining a scene representation transformer with a view-conditioned latent diffusion model guided by epipolar geometry. The approach enforces 3D consistency via epipolar-attention and multi-view self-attention, enabling generation of multiple coherent views and subsequent 3D mesh reconstruction from few inputs. Key contributions include implicit 3D representation learning with geometric cues, a scalable diffusion-based framework, and demonstrated improvements on the GSO dataset in both novel view synthesis (PSNR, SSIM, LPIPS) and 3D reconstruction metrics. The work offers a practical pathway to fast, view-consistent 3D generation from minimal input, with potential applications in VR/AR, robotics, and synthetic data generation for medical visualization.

Abstract

Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
Paper Structure (23 sections, 3 equations, 4 figures, 3 tables)

This paper contains 23 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Pipeline of MVDiff. From a single input or few input images, the transformer encoder translates the image(s) into latent scene representations, implicitely capturing 3D information. The intermediate outputs from the scene representation transformer are used as input by the view-conditioned latent diffusion UNet, generating multi-view consistent images from varying viewpoints.
  • Figure 2: Zero-Shot Novel View Synthesis on GSO. MVDiff outperforms Zero123-XL for single view generation with greater camera control and generation quality. As more views are added, MVDiff resembles the ground-truth with fine details being captured such as elephant tail and turtle shell design.
  • Figure 3: Diversity of Novel View Diffusion with MVDiff on NeRF-Synthetic Dataset. We show nearby views (top and bottom row) displaying good consistency, while more distant views (middle) are more diverse but still realistic.
  • Figure 4: 3D reconstruction from single-view on GSO dataset. MVDiff produces consistent novel views and improves the 3D geometry compared to baselines. One-2-3-45 and SyncDreamer tend to generate overly-smoothed and incomplete 3D objects, in particular the sofa. EscherNet recovers more of the finer details, as for the hat.