Table of Contents
Fetching ...

DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction

Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, Ulrich Neumann

TL;DR

DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields by combining global and local features, achieves the state-of-the-art single-view reconstruction performance.

Abstract

Reconstructing 3D shapes from single-view images has been a long-standing research problem. In this paper, we present DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields. In addition to utilizing global image features, DISN predicts the projected location for each 3D point on the 2D image, and extracts local features from the image feature maps. Combining global and local features significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas. To the best of our knowledge, DISN is the first method that constantly captures details such as holes and thin structures present in 3D shapes from single-view images. DISN achieves the state-of-the-art single-view reconstruction performance on a variety of shape categories reconstructed from both synthetic and real images. Code is available at https://github.com/xharlie/DISN The supplementary can be found at https://xharlie.github.io/images/neurips_2019_supp.pdf

DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction

TL;DR

DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields by combining global and local features, achieves the state-of-the-art single-view reconstruction performance.

Abstract

Reconstructing 3D shapes from single-view images has been a long-standing research problem. In this paper, we present DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields. In addition to utilizing global image features, DISN predicts the projected location for each 3D point on the 2D image, and extracts local features from the image feature maps. Combining global and local features significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas. To the best of our knowledge, DISN is the first method that constantly captures details such as holes and thin structures present in 3D shapes from single-view images. DISN achieves the state-of-the-art single-view reconstruction performance on a variety of shape categories reconstructed from both synthetic and real images. Code is available at https://github.com/xharlie/DISN The supplementary can be found at https://xharlie.github.io/images/neurips_2019_supp.pdf

Paper Structure

This paper contains 26 sections, 3 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Single-view reconstruction results using OccNet Mescheder2019CVPR and DISN on synthetic and real images.
  • Figure 2: Illustration of SDF. (a) Rendered 3D surface with $s = 0$. (b) Cross-section of the SDF. A point is outside the surface if $s > 0$, inside if $s < 0$, and on the surface if $s = 0$.
  • Figure 3: Local feature extraction. Given a 3D point $\mathbf{p}$, we use the estimated camera parameters to project $\mathbf{p}$ onto the image plane. Then we identify the projected location on each feature map layer of the encoder. We concatenate features at each layer to get the local features of point $\mathbf{p}$.
  • Figure 4: Given an image and a point $\mathbf{p}$, we estimate the camera pose and project $\mathbf{p}$ onto the image plane. DISN uses the local features at the projected location, the global features, and the point features to predict the SDF of $\mathbf{p}$. 'MLPs' denotes multi-layer perceptrons.
  • Figure 5: Camera Pose Estimation Network. 'PC' denotes point cloud. 'GT Cam' and 'Pred Cam' denote the ground truth and predicted cameras.
  • ...and 7 more figures