Table of Contents
Fetching ...

GenS: Generalizable Neural Surface Reconstruction from Multi-View Images

Rui Peng, Xiaodong Gu, Luyang Tang, Shihe Shen, Fanqi Yu, Ronggang Wang

TL;DR

GenS tackles the problem of generalizable neural surface reconstruction from multi-view images by introducing a generalized multi-scale volume that encodes multiple scenes in a single model. It leverages a discriminative multi-scale feature-metric consistency in place of photometric consistency and introduces a view contrast loss to transfer dense-input priors to sparse-input reconstructions, enabling robust geometry under limited viewpoints. The method jointly predicts SDF and color via a multi-scale fusion strategy and uses SDF-based volume rendering to recover surfaces, achieving state-of-the-art generalization on DTU and BlendedMVS, with fast inference and efficient fine-tuning. Overall, GenS offers a practical, end-to-end framework for high-fidelity, scalable 3D reconstruction across diverse scenes.

Abstract

Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at https://github.com/prstrive/GenS.

GenS: Generalizable Neural Surface Reconstruction from Multi-View Images

TL;DR

GenS tackles the problem of generalizable neural surface reconstruction from multi-view images by introducing a generalized multi-scale volume that encodes multiple scenes in a single model. It leverages a discriminative multi-scale feature-metric consistency in place of photometric consistency and introduces a view contrast loss to transfer dense-input priors to sparse-input reconstructions, enabling robust geometry under limited viewpoints. The method jointly predicts SDF and color via a multi-scale fusion strategy and uses SDF-based volume rendering to recover surfaces, achieving state-of-the-art generalization on DTU and BlendedMVS, with fast inference and efficient fine-tuning. Overall, GenS offers a practical, end-to-end framework for high-fidelity, scalable 3D reconstruction across diverse scenes.

Abstract

Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at https://github.com/prstrive/GenS.
Paper Structure (26 sections, 18 equations, 9 figures, 4 tables)

This paper contains 26 sections, 18 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Qualitative comparisons on DTU and BlendedMVS datasets with sparse inputs.
  • Figure 2: Illustration of GenS. We first extract multi-scale features through a FPN network. The generalized multi-scale volume is then reconstructed with the corresponding scale feature. We employ the same blending strategy as wang2021ibrnet to estimate the appearance of each point on a ray, and adopt the volume rendering to recover the color of a pixel. We design the multi-scale feature-metric consistency to constrain the geometry as shown in the top right. For convenience, we omit some losses that will be detailed later.
  • Figure 3: Multi-view aggregation ambiguity. Here, we take two viewpoints as an example. (a) For those low-texture regions, sampling points near the surface may get the same aggregation and lack discriminability. (b) The aggregation of points away from the surface are random and hard to infer the accurate geometry, e.g., two sampling points may get the same aggregation even with different SDF value.
  • Figure 4: Multi-scale feature space. The feature space is more discriminative than ordinary image space, and is more potential to find the corresponding point during matching.
  • Figure 5: Locating the surface of a ray.
  • ...and 4 more figures