Table of Contents
Fetching ...

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, Shubham Tulsiani

TL;DR

MVD-Fusion addresses single-view 3D inference by directly generating multiple views that are depth-consistent, avoiding distillation-based postprocessing. It builds a depth-guided multi-view diffusion framework that leverages depth to enforce cross-view coherence via depth-aware attention and a 2.5D representation. The approach yields improved novel-view synthesis and competitive 3D geometry across Objaverse, Google Scanned Objects, and CO3D, with demonstrated diversity and zero-shot generalization to in-the-wild objects. This work offers a practical pathway for fast, multi-view consistent 3D inference from a single image, enabling downstream applications in AR/VR and robotics.

Abstract

We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models, these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically, we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches.

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

TL;DR

MVD-Fusion addresses single-view 3D inference by directly generating multiple views that are depth-consistent, avoiding distillation-based postprocessing. It builds a depth-guided multi-view diffusion framework that leverages depth to enforce cross-view coherence via depth-aware attention and a 2.5D representation. The approach yields improved novel-view synthesis and competitive 3D geometry across Objaverse, Google Scanned Objects, and CO3D, with demonstrated diversity and zero-shot generalization to in-the-wild objects. This work offers a practical pathway for fast, multi-view consistent 3D inference from a single image, enabling downstream applications in AR/VR and robotics.

Abstract

We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models, these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically, we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches.
Paper Structure (24 sections, 5 equations, 8 figures, 5 tables)

This paper contains 24 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Single-view 3D Inference. Given an input RGB image, MVD-Fusion allows synthesizing multi-view RGB-D images using a depth-guided attention mechanism for enforcing multi-view consistency. We visualize the input RGB image (left) and three synthesized novel views (with generated depth in inset).
  • Figure 2: Approach Overview. MVD-Fusion learns a denoising diffusion model for generating multi-view RGB-D images given an input RGB image. At each diffusion timestep $t$, MVD-Fusion uses the current (noisy) depth estimates to compute depth-projection-based multi-view aware features (top). A novel-view diffusion based U-Net is modified to leverage these multi-view aware features as additional conditioning while producing denoised estimates of both, RGB and depth (bottom).
  • Figure 3: We visualize the unprojected point cloud obtained from a set of noisy RGB-D images at different timesteps during inference. We observe the gradual denoising of geometry from a random point cloud to a point cloud that matches the input object.
  • Figure 4: Qualitative results for novel view synthesis on instances from Objaverse (top) and Google Scanned objects (bottom). We compare our method with Zero-1-to-3 liu2023zero1to3 and SyncDreamer liu2023syncdreamer. We show the input image and two novel views generated by each method. Zero-1-to-3 independently generates novel views which are not consistent (e.g., the person in Objaverse). While both, SyncDreamer and MVD-Fusion yield consistent generations, we find that MVD-Fusion can generate more plausible output (e.g., the Android image) and is more faithful to details in the input (e.g., the three cars).
  • Figure 5: Sample Diversity. MVD-Fusion is capable of generating diverse samples given the same input. We show the input image (left) followed by views synthesized in three randomly generated samples. We observe that there is meaningful variation in uncertain regions e.g., the eyes of the character and the colors on the screen vary across samples.
  • ...and 3 more figures