Table of Contents
Fetching ...

VF-NeRF: Learning Neural Vector Fields for Indoor Scene Reconstruction

Albert Gassol Puigjaner, Edoardo Mello Rella, Erik Sandström, Ajad Chhatkuli, Luc Van Gool

TL;DR

VF-NeRF presents a novel neural implicit representation for indoor scene reconstruction by learning a Vector Field (VF) that points toward the nearest surface. The method transforms VF into a differentiable surface density and combines it with volume rendering in a dual-MLP architecture to recover geometry and appearance from multi-view images. A hierarchical ray sampling strategy and a sliding-window density smoothing enable efficient, accurate reconstruction of large planar regions and sharp corners, with depth cues further boosting geometry accuracy. Experimental results on Replica and ScanNet show state-of-the-art performance in 3D reconstruction metrics and competitive novel-view synthesis, validating VF-NeRF’s strong inductive bias toward planar indoor structures. The work highlights a practical approach to indoor scene modeling that gracefully handles low-texture areas while preserving high-frequency details in rendered views.

Abstract

Implicit surfaces via neural radiance fields (NeRF) have shown surprising accuracy in surface reconstruction. Despite their success in reconstructing richly textured surfaces, existing methods struggle with planar regions with weak textures, which account for the majority of indoor scenes. In this paper, we address indoor dense surface reconstruction by revisiting key aspects of NeRF in order to use the recently proposed Vector Field (VF) as the implicit representation. VF is defined by the unit vector directed to the nearest surface point. It therefore flips direction at the surface and equals to the explicit surface normals. Except for this flip, VF remains constant along planar surfaces and provides a strong inductive bias in representing planar surfaces. Concretely, we develop a novel density-VF relationship and a training scheme that allows us to learn VF via volume rendering By doing this, VF-NeRF can model large planar surfaces and sharp corners accurately. We show that, when depth cues are available, our method further improves and achieves state-of-the-art results in reconstructing indoor scenes and rendering novel views. We extensively evaluate VF-NeRF on indoor datasets and run ablations of its components.

VF-NeRF: Learning Neural Vector Fields for Indoor Scene Reconstruction

TL;DR

VF-NeRF presents a novel neural implicit representation for indoor scene reconstruction by learning a Vector Field (VF) that points toward the nearest surface. The method transforms VF into a differentiable surface density and combines it with volume rendering in a dual-MLP architecture to recover geometry and appearance from multi-view images. A hierarchical ray sampling strategy and a sliding-window density smoothing enable efficient, accurate reconstruction of large planar regions and sharp corners, with depth cues further boosting geometry accuracy. Experimental results on Replica and ScanNet show state-of-the-art performance in 3D reconstruction metrics and competitive novel-view synthesis, validating VF-NeRF’s strong inductive bias toward planar indoor structures. The work highlights a practical approach to indoor scene modeling that gracefully handles low-texture areas while preserving high-frequency details in rendered views.

Abstract

Implicit surfaces via neural radiance fields (NeRF) have shown surprising accuracy in surface reconstruction. Despite their success in reconstructing richly textured surfaces, existing methods struggle with planar regions with weak textures, which account for the majority of indoor scenes. In this paper, we address indoor dense surface reconstruction by revisiting key aspects of NeRF in order to use the recently proposed Vector Field (VF) as the implicit representation. VF is defined by the unit vector directed to the nearest surface point. It therefore flips direction at the surface and equals to the explicit surface normals. Except for this flip, VF remains constant along planar surfaces and provides a strong inductive bias in representing planar surfaces. Concretely, we develop a novel density-VF relationship and a training scheme that allows us to learn VF via volume rendering By doing this, VF-NeRF can model large planar surfaces and sharp corners accurately. We show that, when depth cues are available, our method further improves and achieves state-of-the-art results in reconstructing indoor scenes and rendering novel views. We extensively evaluate VF-NeRF on indoor datasets and run ablations of its components.
Paper Structure (18 sections, 14 equations, 10 figures, 5 tables)

This paper contains 18 sections, 14 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: VF-NeRF overview. We use VF to represent the geometry of a scene. Specifically, given an input image taken from the camera view position, we shoot a batch of rays onto the 3D scene. We predict the VF and color of the points along the ray using geometry and color decoders. By computing the cosine similarity between neighboring points on the ray, we can identify the surface as the locations where the value equals $-1$, i.e. when the two predicted vectors have opposing directions. From the cosine similarity, we differentiably compute the surface density. We render the RGB and depth in order to compute the re-rendering losses.
  • Figure 2: Density using non-averaged and averaged cosine similarity. The figures show the VF, cosine similarity and density of a ray crossing a surface. Top: density as a transformation of the cosine similarity. This yields a sharp function similar to the delta function centered at the surface. Bottom: Density as a transformation of the weighted average cosine similarity. This produces a smoother function with the maximum centered at the surface.
  • Figure 3: Sliding window weights annealing example and hierarchical sampling. Top: Example of $M=6$ weights at different stages of the annealing. At the beginning of the training (epoch 0), the weights for each neighbor are equal. At the end of the training (epoch 100) the cosine similarity is computed only with respect to the closest next neighbor. Bottom: Initially, we sample uniform points along the ray and compute the surface density through the predicted VF. We then densely sample points within a range $d_{samples}$ centered at the maximum of the surface density.
  • Figure 4: 3D reconstruction qualitative results. VF-NeRF outperforms the SOTA in planar regions of the scenes such as walls and floors as well as in several details. We highlight regions where VF-NeRF outperforms the other methods with yellow boxes.
  • Figure 5: Ablations. Removing the hierarchical sampling generates holes and artifacts (see table in the meshes). Our method without $\mathcal{L}_{depth}$ is less accurate, as most regions of the scene are low-textured. Nonetheless, it still captures the overall scene coarse geometry.
  • ...and 5 more figures