Table of Contents
Fetching ...

3QFP: Efficient neural implicit surface reconstruction using Tri-Quadtrees and Fourier feature Positional encoding

Shuo Sun, Malcolm Mielle, Achim J. Lilienthal, Martin Magnusson

TL;DR

This work introduces a sparse structure, tri-quadtrees, which represents the environment using learnable features stored in three planar quadtree projections, and concatenates the learnable features with a Fourier feature positional encoding, which facilitates smoother reconstruction with a higher completion ratio with fewer holes.

Abstract

Neural implicit surface representations are currently receiving a lot of interest as a means to achieve high-fidelity surface reconstruction at a low memory cost, compared to traditional explicit representations.However, state-of-the-art methods still struggle with excessive memory usage and non-smooth surfaces. This is particularly problematic in large-scale applications with sparse inputs, as is common in robotics use cases. To address these issues, we first introduce a sparse structure, \emph{tri-quadtrees}, which represents the environment using learnable features stored in three planar quadtree projections. Secondly, we concatenate the learnable features with a Fourier feature positional encoding. The combined features are then decoded into signed distance values through a small multi-layer perceptron. We demonstrate that this approach facilitates smoother reconstruction with a higher completion ratio with fewer holes. Compared to two recent baselines, one implicit and one explicit, our approach requires only 10\%--50\% as much memory, while achieving competitive quality.

3QFP: Efficient neural implicit surface reconstruction using Tri-Quadtrees and Fourier feature Positional encoding

TL;DR

This work introduces a sparse structure, tri-quadtrees, which represents the environment using learnable features stored in three planar quadtree projections, and concatenates the learnable features with a Fourier feature positional encoding, which facilitates smoother reconstruction with a higher completion ratio with fewer holes.

Abstract

Neural implicit surface representations are currently receiving a lot of interest as a means to achieve high-fidelity surface reconstruction at a low memory cost, compared to traditional explicit representations.However, state-of-the-art methods still struggle with excessive memory usage and non-smooth surfaces. This is particularly problematic in large-scale applications with sparse inputs, as is common in robotics use cases. To address these issues, we first introduce a sparse structure, \emph{tri-quadtrees}, which represents the environment using learnable features stored in three planar quadtree projections. Secondly, we concatenate the learnable features with a Fourier feature positional encoding. The combined features are then decoded into signed distance values through a small multi-layer perceptron. We demonstrate that this approach facilitates smoother reconstruction with a higher completion ratio with fewer holes. Compared to two recent baselines, one implicit and one explicit, our approach requires only 10\%--50\% as much memory, while achieving competitive quality.
Paper Structure (18 sections, 4 equations, 7 figures, 3 tables)

This paper contains 18 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Qualitative reconstruction result on $\texttt{KITTI-Seq07}$. Our method can achieve better reconstruction quality using less memory compared to SHINE-Mappingzhong2023shine and VDBFusionvizzo2022vdbfusion. When given noisy and sparse lidar scans, our method can achieve a more complete reconstruction (see red circle and black zoomed-in square areas).
  • Figure 2: Overview of our method. We represent the scene with three planar quadtrees $\mathcal{M}_{i}^{\ell}$, $i \in \{XZ,YZ,XY\}$ and $\ell$ represents the quadtree depth. We store features in the deepest $H$ levels of resolution of quadtrees. When querying for a point $\bm{p}$, we project it onto planar quadtrees to identify the node containing $\bm{p}$ at the level $\ell$. The feature of $\bm{p}$ is then calculated by bilinear interpolation based on the queried location and vertex features. We add features at the same level and concatenate among different levels. Concatenated with the positional encoding $\gamma(\bm{p})$, $\bm{p}$'s feature ($\Phi(\bm{p})$) is fed into a small MLP ($\mathcal{F}_\Theta$) to predict the SDF value. The learnable features stored in the quadtree nodes and the network parameters are learned by test-time optimization using the loss function $\mathcal{L}_{\text{bce}}$. The learnable feature vectors have length $d$ and the positional encoding feature vector has length $6m$.
  • Figure 3: Comparison of Completion Ratio[%] versus the input frame numbers $n_{s}$ on two datasets. The threshold is 0.1 m for MaiCity and 0.2 m for NewerCollege. As the inputs get sparser, the completion ratio of VDBFusion drops significantly, while our method maintains a high completion ratio. Though with similar performance, our method uses fewer parameters than SHINE-Mapping (see \ref{['fig:parametersVsSparse']}).
  • Figure 4: Qualitative visualization of the map quality on the $\texttt{MaiCity}$ dataset using every 6th frame. The first row depicts the difference between the dense ground truth point cloud and the reconstructed mesh; the ground truth points with an error of more than 0.1 m are highlighted in orange. The second row shows zoomed-in images of the dashed areas (indicated in the top-right image). When inputs are sparse (e.g., every 6th frame in this case), our method obtains visibly smoother results.
  • Figure 5: Number of learnable parameters versus subsampling frequency given as $n_{s}$, the number of frames after which another frame was selected from the two datasets. Our method only needs about $25\%$ and $10\%$ parameters of SHINE-Mapping on MaiCity and the NewerCollege dataset, respectively.
  • ...and 2 more figures