MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation
Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, Shubham Tulsiani
TL;DR
MVD-Fusion addresses single-view 3D inference by directly generating multiple views that are depth-consistent, avoiding distillation-based postprocessing. It builds a depth-guided multi-view diffusion framework that leverages depth to enforce cross-view coherence via depth-aware attention and a 2.5D representation. The approach yields improved novel-view synthesis and competitive 3D geometry across Objaverse, Google Scanned Objects, and CO3D, with demonstrated diversity and zero-shot generalization to in-the-wild objects. This work offers a practical pathway for fast, multi-view consistent 3D inference from a single image, enabling downstream applications in AR/VR and robotics.
Abstract
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models, these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically, we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches.
