Table of Contents
Fetching ...

MVBoost: Boost 3D Reconstruction with Multi-View Refinement

Xiangyu Liu, Xiaomei Zhang, Zhiyuan Ma, Xiangyu Zhu, Zhen Lei

TL;DR

MVBoost addresses the scarcity of high-quality 3D data for single-view reconstruction by fusing a high-accuracy multi-view diffusion model with a consistent 3D reconstruction model to generate pseudo-ground-truth and train a fast, feed-forward reconstructor. It introduces a multi-view refinement strategy to produce refined pseudo-views, and a LoRA-enhanced boosting reconstruction model trained on this data, coupled with input-view optimization to align the final asset with the user’s input view. The approach yields state-of-the-art results on Google Scanned Objects, with strong improvements in both 2D view quality and 3D geometry, and demonstrates generalization through open-world baselines and an OpenLRM boost. This framework enables scalable, diverse 3D asset generation from a single image without requiring large pre-existing 3D datasets, offering practical benefits for game content, AR/VR, and animation pipelines.

Abstract

Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.

MVBoost: Boost 3D Reconstruction with Multi-View Refinement

TL;DR

MVBoost addresses the scarcity of high-quality 3D data for single-view reconstruction by fusing a high-accuracy multi-view diffusion model with a consistent 3D reconstruction model to generate pseudo-ground-truth and train a fast, feed-forward reconstructor. It introduces a multi-view refinement strategy to produce refined pseudo-views, and a LoRA-enhanced boosting reconstruction model trained on this data, coupled with input-view optimization to align the final asset with the user’s input view. The approach yields state-of-the-art results on Google Scanned Objects, with strong improvements in both 2D view quality and 3D geometry, and demonstrates generalization through open-world baselines and an OpenLRM boost. This framework enables scalable, diverse 3D asset generation from a single image without requiring large pre-existing 3D datasets, offering practical benefits for game content, AR/VR, and animation pipelines.

Abstract

Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Given a single image as input, our MVBoost can generate a high-quality 3D asset.
  • Figure 2: The overview of our MVBoost framework. Given a single-view image dataset, we first employ a multi-view diffusion model to generate the original multi-view dataset. Then the original multi-view is sent into a large 3D reconstruction model to produce the 3D Gaussian Splatting. Several views are rendered from this 3D Gaussian Splatting, and refined by the diffusion model to produce the refined multi-view dataset. During training, the refined multi-view dataset is used to supervise the 3D reconstruction model with LoRA. Finally, the generated 3D assets are optimized to align with the specific input viewpoint, obtaining high-fidelity reconstruction results.
  • Figure 3: Qualitative comparison of image-to-3D methods. Our approach demonstrates superior 3D generation across a range of challenging images.
  • Figure 4: Qualitative comparison of 2D multi-view data before and after refinement. Our refined multi-view shows enhanced geometric correction and superior fidelity in detail reproduction compared to the original.
  • Figure 5: Our multi-view refinement strategy effectively corrects substantial viewpoint errors in the original views.
  • ...and 2 more figures