MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View
Emmanuelle Bourigault, Pauline Bourigault
TL;DR
MVDiff tackles the problem of inconsistent multi-view synthesis for single-view 3D reconstruction by combining a scene representation transformer with a view-conditioned latent diffusion model guided by epipolar geometry. The approach enforces 3D consistency via epipolar-attention and multi-view self-attention, enabling generation of multiple coherent views and subsequent 3D mesh reconstruction from few inputs. Key contributions include implicit 3D representation learning with geometric cues, a scalable diffusion-based framework, and demonstrated improvements on the GSO dataset in both novel view synthesis (PSNR, SSIM, LPIPS) and 3D reconstruction metrics. The work offers a practical pathway to fast, view-consistent 3D generation from minimal input, with potential applications in VR/AR, robotics, and synthetic data generation for medical visualization.
Abstract
Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
