Table of Contents
Fetching ...

MVD$^2$: Efficient Multiview 3D Reconstruction for Multiview Diffusion

Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, Yang Liu

TL;DR

Multiview diffusion enables 3D generation from prompts but reconstructing accurate geometry from sparse, inconsistent MVD views is challenging. The paper introduces MVD$^2$, a lightweight network that maps multi-view image features into a 3D feature volume and decodes a differentiable triangle mesh, guided by a view-dependent training scheme to handle view inconsistency. Trained on Zero-123++ and Objaverse-LVIS, MVD$^2$ delivers high-quality geometry with inference times under 1 second and demonstrates strong robustness across different MVD models and prompts, outperforming NeuS and other baselines. The approach generalizes to image- and text-conditioned MVD outputs and provides practical, scalable reconstruction for diverse MVD pipelines, with released code and models to support future research.

Abstract

As a promising 3D generation technique, multiview diffusion (MVD) has received a lot of attention due to its advantages in terms of generalizability, quality, and efficiency. By finetuning pretrained large image diffusion models with 3D data, the MVD methods first generate multiple views of a 3D object based on an image or text prompt and then reconstruct 3D shapes with multiview 3D reconstruction. However, the sparse views and inconsistent details in the generated images make 3D reconstruction challenging. We present MVD$^2$, an efficient 3D reconstruction method for multiview diffusion (MVD) images. MVD$^2$ aggregates image features into a 3D feature volume by projection and convolution and then decodes volumetric features into a 3D mesh. We train MVD$^2$ with 3D shape collections and MVD images prompted by rendered views of 3D shapes. To address the discrepancy between the generated multiview images and ground-truth views of the 3D shapes, we design a simple-yet-efficient view-dependent training scheme. MVD$^2$ improves the 3D generation quality of MVD and is fast and robust to various MVD methods. After training, it can efficiently decode 3D meshes from multiview images within one second. We train MVD$^2$ with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its superior performance in generating 3D models from multiview images generated by different MVD methods, using both synthetic and real images as prompts.

MVD$^2$: Efficient Multiview 3D Reconstruction for Multiview Diffusion

TL;DR

Multiview diffusion enables 3D generation from prompts but reconstructing accurate geometry from sparse, inconsistent MVD views is challenging. The paper introduces MVD, a lightweight network that maps multi-view image features into a 3D feature volume and decodes a differentiable triangle mesh, guided by a view-dependent training scheme to handle view inconsistency. Trained on Zero-123++ and Objaverse-LVIS, MVD delivers high-quality geometry with inference times under 1 second and demonstrates strong robustness across different MVD models and prompts, outperforming NeuS and other baselines. The approach generalizes to image- and text-conditioned MVD outputs and provides practical, scalable reconstruction for diverse MVD pipelines, with released code and models to support future research.

Abstract

As a promising 3D generation technique, multiview diffusion (MVD) has received a lot of attention due to its advantages in terms of generalizability, quality, and efficiency. By finetuning pretrained large image diffusion models with 3D data, the MVD methods first generate multiple views of a 3D object based on an image or text prompt and then reconstruct 3D shapes with multiview 3D reconstruction. However, the sparse views and inconsistent details in the generated images make 3D reconstruction challenging. We present MVD, an efficient 3D reconstruction method for multiview diffusion (MVD) images. MVD aggregates image features into a 3D feature volume by projection and convolution and then decodes volumetric features into a 3D mesh. We train MVD with 3D shape collections and MVD images prompted by rendered views of 3D shapes. To address the discrepancy between the generated multiview images and ground-truth views of the 3D shapes, we design a simple-yet-efficient view-dependent training scheme. MVD improves the 3D generation quality of MVD and is fast and robust to various MVD methods. After training, it can efficiently decode 3D meshes from multiview images within one second. We train MVD with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its superior performance in generating 3D models from multiview images generated by different MVD methods, using both synthetic and real images as prompts.
Paper Structure (35 sections, 4 equations, 8 figures, 3 tables)

This paper contains 35 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Inconsistency from training object increases as the viewpoint moves away from the reference image.
  • Figure 2: Method overview. The MVD model produces a set of images from different viewpoints based on a reference image. MVD$^2$ extracts and averages features from these images for each point in a coarse 3D grid $G$, and interpolates them into a finer grid $G'$, from which the surface mesh is extracted in a differentiable manner. The mesh reconstruction during training is supervised with pixelwise loss (red arrow) against depth/normal/mask maps at the reference view $v_0$, and with structural loss (yellow arrow) against normal maps at the other views. The reconstructed mesh can be textured by mapping to MVD images.
  • Figure 3: 3D reconstruction of Zero123++'s MVD images. The results of NeuS and MVD$^2$ are rendered in blue and cyan tones, respectively, from three different views.
  • Figure 4: Visualization of three examples reconstructed by different variants of MVD$^2$. Left is the input MVD images.
  • Figure 5: Illustration of imperfect and failure reconstruction results. GTs are the 3D objects for reference.
  • ...and 3 more figures