Table of Contents
Fetching ...

Hierarchical Pose Estimation and Mapping with Multi-Scale Neural Feature Fields

Evgenii Kruzhkov, Alena Savinykh, Sven Behnke

TL;DR

This work addresses large-scale SLAM with unknown poses by introducing a probabilistic implicit-mapping framework built on octree-based neural fields and hierarchical pose optimization. It learns a signed distance function map from sequential LiDAR data, using coarse-to-fine activation of multi-level MLPs to capture both coarse geometry and fine details, while propagating pose gradients through the measurements. The approach demonstrates strong localization on KITTI and superior mapping performance on MaiCity compared to sequential baselines, without requiring ground-truth poses. With real-time-like performance on standard GPUs, the method offers a practical, scalable solution for open-world robotic perception and navigation using neural implicit representations.

Abstract

Robotic applications require a comprehensive understanding of the scene. In recent years, neural fields-based approaches that parameterize the entire environment have become popular. These approaches are promising due to their continuous nature and their ability to learn scene priors. However, the use of neural fields in robotics becomes challenging when dealing with unknown sensor poses and sequential measurements. This paper focuses on the problem of sensor pose estimation for large-scale neural implicit SLAM. We investigate implicit mapping from a probabilistic perspective and propose hierarchical pose estimation with a corresponding neural network architecture. Our method is well-suited for large-scale implicit map representations. The proposed approach operates on consecutive outdoor LiDAR scans and achieves accurate pose estimation, while maintaining stable mapping quality for both short and long trajectories. We built our method on a structured and sparse implicit representation suitable for large-scale reconstruction and evaluated it using the KITTI and MaiCity datasets. Our approach outperforms the baseline in terms of mapping with unknown poses and achieves state-of-the-art localization accuracy.

Hierarchical Pose Estimation and Mapping with Multi-Scale Neural Feature Fields

TL;DR

This work addresses large-scale SLAM with unknown poses by introducing a probabilistic implicit-mapping framework built on octree-based neural fields and hierarchical pose optimization. It learns a signed distance function map from sequential LiDAR data, using coarse-to-fine activation of multi-level MLPs to capture both coarse geometry and fine details, while propagating pose gradients through the measurements. The approach demonstrates strong localization on KITTI and superior mapping performance on MaiCity compared to sequential baselines, without requiring ground-truth poses. With real-time-like performance on standard GPUs, the method offers a practical, scalable solution for open-world robotic perception and navigation using neural implicit representations.

Abstract

Robotic applications require a comprehensive understanding of the scene. In recent years, neural fields-based approaches that parameterize the entire environment have become popular. These approaches are promising due to their continuous nature and their ability to learn scene priors. However, the use of neural fields in robotics becomes challenging when dealing with unknown sensor poses and sequential measurements. This paper focuses on the problem of sensor pose estimation for large-scale neural implicit SLAM. We investigate implicit mapping from a probabilistic perspective and propose hierarchical pose estimation with a corresponding neural network architecture. Our method is well-suited for large-scale implicit map representations. The proposed approach operates on consecutive outdoor LiDAR scans and achieves accurate pose estimation, while maintaining stable mapping quality for both short and long trajectories. We built our method on a structured and sparse implicit representation suitable for large-scale reconstruction and evaluated it using the KITTI and MaiCity datasets. Our approach outperforms the baseline in terms of mapping with unknown poses and achieves state-of-the-art localization accuracy.
Paper Structure (11 sections, 6 equations, 5 figures, 2 tables)

This paper contains 11 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Demonstration of the flexibility of the proposed implicit mapping extended to semantic domain. We create the map using sequential LiDAR measurements and corresponding semantic labels from the SemanticKITTI dataset behley2021ijrr. Our approach effectively learns a 3D semantic representation from the data, demonstrating its generalization abilities. We employ the Marching Cubes algorithm lorensen1998marching to visualize the learned information.
  • Figure 2: Overview of the proposed approach. Our method utilizes neural fields consisting of learnable $F$-dimensional features stored in the corners of an octree-based structure. Each trainable octree level is associated with tiny MLP networks. During the forward pass, the LiDAR measurements are transformed to world coordinates using an initial transformation $T_{i-1}$. The features of the voxel corners are then weighted based on the relative position of the sampled point $x_i$ (red) from the measurements. These weighted features are concatenated and fed to the corresponding MLP. The predictions of all layers are accumulated to generate the final occupancy probability $\bar{y}_i$ for the sampled points. During the backward pass, the gradient values are backpropagated to optimize the transformation $T_{i-1}$ toward $T_i$ in two ways: directly through measurements $y_i$ and hierarchically though $\bar{y}_i$ (Sec. \ref{['pose_optimization']}).
  • Figure 3: The coarse and final representations learned by our proposed approach, reconstructed using the Marching Cubes algorithm. (a) shows the representation learned solely by the coarse level MLP, while (b) displays the final representation that includes high-frequency features from the finer levels.
  • Figure 4: Estimated trajectories of the proposed approach on the KITTI dataset. The colorbar visualizes the distance between the poses of the estimated and ground truth trajectories.
  • Figure 5: Revisiting of previously mapped area. The map is initialized during the first visit of the region (green). During the second traversal (blue), the path successfully converges to the path taken during the first visit. Both traversals have the same motion direction.