Table of Contents
Fetching ...

MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations

Yifan Yan, Ruomin He, Zhenghua Liu

TL;DR

MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments, and dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information.

Abstract

We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. As previous methods often require pre-defined scene boundaries, MUTE-SLAM dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only ensures real-time performance but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code is available at https://github.com/lumennYan/MUTE_SLAM.

MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations

TL;DR

MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments, and dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information.

Abstract

We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. As previous methods often require pre-defined scene boundaries, MUTE-SLAM dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only ensures real-time performance but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code is available at https://github.com/lumennYan/MUTE_SLAM.
Paper Structure (31 sections, 10 equations, 5 figures, 8 tables)

This paper contains 31 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Our MUTE-SLAM system demonstrates rapid and accurate tracking and mapping across indoor environments of varying scales without pre-defined boundaries. We depict the trajectories and meshes of both a small and a large scenario: estimated trajectories are marked in blue, while ground truths are in green. The left image is an around-a-desk scene from the TUM-RGBD dataset sturm2012benchmark, while the image on the right is a multiple-room scene from Apartment dataset provided by NICE-SLAM zhu2022nice.
  • Figure 2: The overview of MUTE-SLAM.Our method consists of three parts. 1)Scene representation: the whole scene is represented by several sub-maps created on the fly. Each sub-map is formulated by double tri-plane hash-encoders, one for TSDF and the other for color encoding. 2)Tracking: this module optimizes the pose for each frame through differentiable rendering. 3)Mapping: the mapping module dynamically allocates new sub-maps with a tracked pose. It conducts a joint optimization of both scene and pose parameters, utilizing the current frame along with co-visible keyframes. 4)Bundle Adjustment: by sampling keyframes globally, this module further refines all trainable parameters and ensures global consistency.
  • Figure 3: Qualitative reconstruction results on Replica.
  • Figure 4: Qualitative reconstruction results on ScanNet dai2017scannet. Our reconstructed mesh achieves better completion and fewer artifacts compared to ESLAM johari2023eslam. Additionally, our method produces sharper and more detailed geometry than Co-SLAM wang2023co.
  • Figure 5: Qualitative comparison of our method employing tri-plane hash-encoding versus without it, using reconstructed meshes from Replica straub2019replica scenes.The left-most images illustrate how hash collisions can result in rough surfaces and low-quality textures in flat areas like walls and windows. Our tri-plane approach significantly mitigates these issues, achieving better results even with smaller hash tables. The other two images further show that our design leaves fewer artifacts in unobserved regions.