Table of Contents
Fetching ...

Vox-Fusion++: Voxel-based Neural Implicit Dense Tracking and Mapping with Multi-maps

Hongjia Zhai, Hai Li, Xingrui Yang, Gan Huang, Yuhang Ming, Hujun Bao, Guofeng Zhang

TL;DR

Vox-Fusion++ tackles robust, real-time dense SLAM by unifying voxel-based neural implicit surfaces with traditional volumetric fusion in a dynamic octree. It leverages sparse voxel embeddings, an on-the-fly SDF decoder $F_{\theta}$, and differentiable rendering to achieve accurate geometry and color, while adopting a multi-map framework with loop closure and hierarchical pose optimization to scale to large scenes. Key contributions include dynamic voxel expansion without scene bounds, a multi-map incremental mapping strategy, appearance-and-geometry loop detection, and inter/intra-map optimization that reduce drift and duplicate geometry, all with favorable time and memory characteristics. The approach enables AR occlusion handling and collaborative mapping across multiple agents, demonstrating strong reconstruction quality and efficiency on benchmarks and large real-world scenes.

Abstract

In this paper, we introduce Vox-Fusion++, a multi-maps-based robust dense tracking and mapping system that seamlessly fuses neural implicit representations with traditional volumetric fusion techniques. Building upon the concept of implicit mapping and positioning systems, our approach extends its applicability to real-world scenarios. Our system employs a voxel-based neural implicit surface representation, enabling efficient encoding and optimization of the scene within each voxel. To handle diverse environments without prior knowledge, we incorporate an octree-based structure for scene division and dynamic expansion. To achieve real-time performance, we propose a high-performance multi-process framework. This ensures the system's suitability for applications with stringent time constraints. Additionally, we adopt the idea of multi-maps to handle large-scale scenes, and leverage loop detection and hierarchical pose optimization strategies to reduce long-term pose drift and remove duplicate geometry. Through comprehensive evaluations, we demonstrate that our method outperforms previous methods in terms of reconstruction quality and accuracy across various scenarios. We also show that our Vox-Fusion++ can be used in augmented reality and collaborative mapping applications. Our source code will be publicly available at \url{https://github.com/zju3dv/Vox-Fusion_Plus_Plus}

Vox-Fusion++: Voxel-based Neural Implicit Dense Tracking and Mapping with Multi-maps

TL;DR

Vox-Fusion++ tackles robust, real-time dense SLAM by unifying voxel-based neural implicit surfaces with traditional volumetric fusion in a dynamic octree. It leverages sparse voxel embeddings, an on-the-fly SDF decoder , and differentiable rendering to achieve accurate geometry and color, while adopting a multi-map framework with loop closure and hierarchical pose optimization to scale to large scenes. Key contributions include dynamic voxel expansion without scene bounds, a multi-map incremental mapping strategy, appearance-and-geometry loop detection, and inter/intra-map optimization that reduce drift and duplicate geometry, all with favorable time and memory characteristics. The approach enables AR occlusion handling and collaborative mapping across multiple agents, demonstrating strong reconstruction quality and efficiency on benchmarks and large real-world scenes.

Abstract

In this paper, we introduce Vox-Fusion++, a multi-maps-based robust dense tracking and mapping system that seamlessly fuses neural implicit representations with traditional volumetric fusion techniques. Building upon the concept of implicit mapping and positioning systems, our approach extends its applicability to real-world scenarios. Our system employs a voxel-based neural implicit surface representation, enabling efficient encoding and optimization of the scene within each voxel. To handle diverse environments without prior knowledge, we incorporate an octree-based structure for scene division and dynamic expansion. To achieve real-time performance, we propose a high-performance multi-process framework. This ensures the system's suitability for applications with stringent time constraints. Additionally, we adopt the idea of multi-maps to handle large-scale scenes, and leverage loop detection and hierarchical pose optimization strategies to reduce long-term pose drift and remove duplicate geometry. Through comprehensive evaluations, we demonstrate that our method outperforms previous methods in terms of reconstruction quality and accuracy across various scenarios. We also show that our Vox-Fusion++ can be used in augmented reality and collaborative mapping applications. Our source code will be publicly available at \url{https://github.com/zju3dv/Vox-Fusion_Plus_Plus}
Paper Structure (20 sections, 8 equations, 9 figures, 6 tables)

This paper contains 20 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of our SLAM system. The whole system consists of four parts: 1) Tracking Process: Taking RGB-D frames as input and optimizing camera poses through differentiable rendering, 2) Volume Renderer: This module encodes the scene in a MLP and voxel feature embedding, producing rendered color and Signed Distance Function (SDF) values for each point, 3) Mapping Process: Reconstructing the geometry of the scene via volume rendering and perform incremental mapping with multi-maps for large scenes, 4) Loop Process: Performing loop detection and hierarchical pose optimization between different maps to reduce the pose drift.
  • Figure 2: Illustration of our loop detection. 1) Appearance check. We first calculate the similarity between the current frame and keyframes inside the loop map candidate. 2) Geometry check. We then perform the intersection test between rays from current frame and sparse voxels of loop map.
  • Figure 3: Illustration of hierarchical pose optimization. The optimization process consists of two steps: Inter-map pose optimization and intra-map pose optimization. When loop closure happens, we first perform inter-map optimization to update the pose of each map, $\{T_{i}^m\}$. Then, perform global bundle adjustment of keyframes within a map.
  • Figure 4: Qualitative reconstruction results on the Replica dataset. From left to right, we show the results of scene reconstruction of different methods (iMAP$^{*}$, NICE-SLAM, our method, and ground truth). It can be clearly seen that our reconstruction results are much better than iMAP$^{*}$. To better show the difference in reconstruction between NICE-SLAM and our method, we use red boxes in the figures to indicate the improvements over NICE-SLAM.
  • Figure 5: Qualitative comparison on ScanNet dataset from different views. From left to right, we show the results of scene reconstruction of different methods (iMap$^*$, NICE-SLAM, Co-SLAM, ours, and ScanNet Mesh).
  • ...and 4 more figures