Table of Contents
Fetching ...

Hi-Map: Hierarchical Factorized Radiance Field for High-Fidelity Monocular Dense Mapping

Tongyan Hua, Haotian Bai, Zidong Cao, Ming Liu, Dacheng Tao, Lin Wang

TL;DR

Hi-Map addresses monocular dense mapping without depth priors by introducing a hierarchical factorized grid representation and a dual-path encoding strategy. It decouples geometry and appearance, applies SDF-based proxy rendering for stable density estimation, and performs online optimization within a sliding window to achieve real-time performance. The method demonstrates superior geometric and textural fidelity on the Replica dataset and shows robustness in textureless regions, outperforming state-of-the-art monocular NeRF-based methods. This approach reduces memory and computation while maintaining high-quality reconstructions, enabling practical dense mapping in depth-scarce scenarios.

Abstract

In this paper, we introduce Hi-Map, a novel monocular dense mapping approach based on Neural Radiance Field (NeRF). Hi-Map is exceptional in its capacity to achieve efficient and high-fidelity mapping using only posed RGB inputs. Our method eliminates the need for external depth priors derived from e.g., a depth estimation model. Our key idea is to represent the scene as a hierarchical feature grid that encodes the radiance and then factorizes it into feature planes and vectors. As such, the scene representation becomes simpler and more generalizable for fast and smooth convergence on new observations. This allows for efficient computation while alleviating noise patterns by reducing the complexity of the scene representation. Buttressed by the hierarchical factorized representation, we leverage the Sign Distance Field (SDF) as a proxy of rendering for inferring the volume density, demonstrating high mapping fidelity. Moreover, we introduce a dual-path encoding strategy to strengthen the photometric cues and further boost the mapping quality, especially for the distant and textureless regions. Extensive experiments demonstrate our method's superiority in geometric and textural accuracy over the state-of-the-art NeRF-based monocular mapping methods.

Hi-Map: Hierarchical Factorized Radiance Field for High-Fidelity Monocular Dense Mapping

TL;DR

Hi-Map addresses monocular dense mapping without depth priors by introducing a hierarchical factorized grid representation and a dual-path encoding strategy. It decouples geometry and appearance, applies SDF-based proxy rendering for stable density estimation, and performs online optimization within a sliding window to achieve real-time performance. The method demonstrates superior geometric and textural fidelity on the Replica dataset and shows robustness in textureless regions, outperforming state-of-the-art monocular NeRF-based methods. This approach reduces memory and computation while maintaining high-quality reconstructions, enabling practical dense mapping in depth-scarce scenarios.

Abstract

In this paper, we introduce Hi-Map, a novel monocular dense mapping approach based on Neural Radiance Field (NeRF). Hi-Map is exceptional in its capacity to achieve efficient and high-fidelity mapping using only posed RGB inputs. Our method eliminates the need for external depth priors derived from e.g., a depth estimation model. Our key idea is to represent the scene as a hierarchical feature grid that encodes the radiance and then factorizes it into feature planes and vectors. As such, the scene representation becomes simpler and more generalizable for fast and smooth convergence on new observations. This allows for efficient computation while alleviating noise patterns by reducing the complexity of the scene representation. Buttressed by the hierarchical factorized representation, we leverage the Sign Distance Field (SDF) as a proxy of rendering for inferring the volume density, demonstrating high mapping fidelity. Moreover, we introduce a dual-path encoding strategy to strengthen the photometric cues and further boost the mapping quality, especially for the distant and textureless regions. Extensive experiments demonstrate our method's superiority in geometric and textural accuracy over the state-of-the-art NeRF-based monocular mapping methods.
Paper Structure (12 sections, 10 equations, 10 figures, 1 table)

This paper contains 12 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Our Hi-Map delivers higher mapping fidelity compared to existing state-of-the-art methods goslam23 with monocular observations, even without the use of geometric priors derived from rigorous global optimization of external tracking systems.
  • Figure 2: Illustration of factorization scheme of a feature grid. For a point $p$ of coordinate $(x,y,z)$, its value is assigned by performing trilinear interpolation at the 8 vertexes of the voxel when adopting dense feature grid encoding. When applying factorization, the value of $p$ is estimated by summing 3 components ($F^x, F^y, F^z$) to $F^{xyz}(p)$. An example is given for the value interpolation on component $F^z$, which includes the matrix component $M^{xy}$ and vector component $v^{z}$.
  • Figure 3: The proposed pipeline of our Hi-Map. Given a posed RGB frame $T_t$, the sampled coordinate $p_i$ is encoded by the Multi-resolution Factorized Feature Grid $F_l$ for appearance $\mathcal{F}_{app}$ and geometry $\mathcal{F}_{geo}$, which is decoded by $\Phi_{app}$ and $\Phi_{geo}$ to color ($c_i$) and SDF ($s_i$) through a Dual-Path Decoding, respectively. The volume rendering is performed based on the Proxy function $\boldsymbol{P(\cdot)}$ that transforms SDF to its density ($\sigma_i$), enabling continuous learning of neural implicit mapping on the observations in sliding window per timestep $t$.
  • Figure 4: Impact of geometric representations on volume rendering. Hi-Map leverages SDF (density) representation, which includes a transformation of SDF to density. Consequently, it leads to a smoother gradient of weights compared to SDF (direct), where SDF is directly transformed into a weighting factor. SDF (density) also demonstrates faster convergence compared to occupancy.
  • Figure 5: Comparison of final reconstruction on Replica dataset. The blind spot regions are delineated with red (GO-SLAM) and green (Hi-Map) boxes, respectively, and corresponding visualizations are provided from observable viewpoints. Our approach achieves higher scene fidelity and exhibits stronger expressive capability for indoor vertical planes.
  • ...and 5 more figures