Table of Contents
Fetching ...

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

Zi-Xin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang

TL;DR

<3-5 sentence high-level summary> Sparse3D addresses 3D reconstruction from extremely sparse views by distilling priors from a multiview-consistent diffusion model into NeRF, guided by an epipolar-controlled diffusion framework. It introduces a specialized Epipolar Controller and an enhanced feature renderer to produce multiview-consistent novel-view images, then employs Category-Score Distillation Sampling (C-SDS) to sharpen detail in NeRF optimization. Empirical results on CO3DV2 show state-of-the-art performance in both novel-view synthesis and geometry reconstruction, with strong generalization to unseen categories thanks to Stable Diffusion priors. The approach trades a bit of runtime for significantly improved 3D fidelity and detail in sparse-view scenarios, enabling robust open-world object reconstruction.

Abstract

Reconstructing 3D objects from extremely sparse views is a long-standing and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction.

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

TL;DR

<3-5 sentence high-level summary> Sparse3D addresses 3D reconstruction from extremely sparse views by distilling priors from a multiview-consistent diffusion model into NeRF, guided by an epipolar-controlled diffusion framework. It introduces a specialized Epipolar Controller and an enhanced feature renderer to produce multiview-consistent novel-view images, then employs Category-Score Distillation Sampling (C-SDS) to sharpen detail in NeRF optimization. Empirical results on CO3DV2 show state-of-the-art performance in both novel-view synthesis and geometry reconstruction, with strong generalization to unseen categories thanks to Stable Diffusion priors. The approach trades a bit of runtime for significantly improved 3D fidelity and detail in sparse-view scenarios, enabling robust open-world object reconstruction.

Abstract

Reconstructing 3D objects from extremely sparse views is a long-standing and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction.
Paper Structure (54 sections, 10 equations, 12 figures, 9 tables)

This paper contains 54 sections, 10 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Novel-view synthesis from two input views using our Sparse3D and SparseFusion. Our approach can achieve higher-quality images with more details for unseen instances, especially for the unobserved regions of them (e.g., the left face of the teddybear). Furthermore, our approach can generalize to some unseen categories without any further finetuning, while SparseFusion fails.
  • Figure 2: Overview of Sparse3D. Our approach consists of two key components: a multiview-consistent diffusion model and a category-score distillation sampling. We utilize epipolar feature map to control the Stable Diffusion model to generate images consistent with the content of input images, serving as a multiview-consistent diffusion model. Based on such a model, we propose a category-score distillation sampling (C-SDS) strategy to achieve more detailed results during NeRF reconstruction.
  • Figure 3: Multiview-consistent diffusion model. Our multiview-consistent diffusion model comprises a feature renderer, an epipolar controller, and a Stable Diffusion model.
  • Figure 4: Qualitative comparison of novel-view synthesis when given 2 input views. Our approach achieves both high quality and more details of novel-view images compared to the others (e.g., the face of the teddybear), whenever with unseen instances and unseen categories.
  • Figure 5: Geometry reconstruction using SparseFusion and Ours. The last column shows the ground-truth point cloud.
  • ...and 7 more figures