DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System

Weifeng Wei; Jie Wang; Shuqi Deng; Jie Liu

DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System

Weifeng Wei, Jie Wang, Shuqi Deng, Jie Liu

TL;DR

DF-SLAM tackles high-fidelity dense visual SLAM by representing the scene with dictionary factors: geometry basis grids $B_g^l$ and $B_a^l$ and coefficient grids $C_g$, $C_a$, whose per-point features feed decoders to produce $s(x_i)$ and color. It adds feature integration rendering where ray features are aggregated as $f_a(r)=\sum w_i f_a(x_i)$ and colored via a shallow MLP, achieving faster color rendering without sacrificing quality. The method jointly optimizes geometry and appearance through a composite loss $\mathcal{L}=\lambda_c\mathcal{L}_c+\lambda_d\mathcal{L}_d+\lambda_{fs}\mathcal{L}_{fs}+\lambda_{sdf}\mathcal{L}_{sdf}$, with tracking and mapping performed over a frame window and distinct initialization and update rules. Extensive experiments on Replica, ScanNet, and TUM-RGBD show competitive real-time performance, detailed reconstructions, and robust camera localization, with ablations confirming the value of the dictionary-factor design and feature integration rendering. The code is released to facilitate public benchmarking and reuse.

Abstract

We introduce a high-fidelity neural implicit dense visual Simultaneous Localization and Mapping (SLAM) system, termed DF-SLAM. In our work, we employ dictionary factors for scene representation, encoding the geometry and appearance information of the scene as a combination of basis and coefficient factors. Compared to neural implicit dense visual SLAM methods that directly encode scene information as features, our method exhibits superior scene detail reconstruction capabilities and more efficient memory usage, while our model size is insensitive to the size of the scene map, making our method more suitable for large-scale scenes. Additionally, we employ feature integration rendering to accelerate color rendering speed while ensuring color rendering quality, further enhancing the real-time performance of our neural SLAM method. Extensive experiments on synthetic and real-world datasets demonstrate that our method is competitive with existing state-of-the-art neural implicit SLAM methods in terms of real-time performance, localization accuracy, and scene reconstruction quality. Our source code is available at https://github.com/funcdecl/DF-SLAM.

DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System

TL;DR

DF-SLAM tackles high-fidelity dense visual SLAM by representing the scene with dictionary factors: geometry basis grids

and

and coefficient grids

, whose per-point features feed decoders to produce

and color. It adds feature integration rendering where ray features are aggregated as

and colored via a shallow MLP, achieving faster color rendering without sacrificing quality. The method jointly optimizes geometry and appearance through a composite loss

, with tracking and mapping performed over a frame window and distinct initialization and update rules. Extensive experiments on Replica, ScanNet, and TUM-RGBD show competitive real-time performance, detailed reconstructions, and robust camera localization, with ablations confirming the value of the dictionary-factor design and feature integration rendering. The code is released to facilitate public benchmarking and reuse.

Abstract

Paper Structure (23 sections, 12 equations, 8 figures, 8 tables)

This paper contains 23 sections, 12 equations, 8 figures, 8 tables.

Introduction
RELATED WORKS
Traditional and Learning-based Dense Visual SLAM
Neural Implicit Representations
Neural Implicit Dense Visual SLAM
METHOD
Dictionary Factors Representation
Feature Integration Rendering
Tracking and Mapping
Loss Functions
Tracking
Mapping
Experiments
Experimental Setup
Datasets
...and 8 more sections

Figures (8)

Figure 1: Overview. 1) Scene Representation: We use two different sets of factor grids to represent the scene geometry and appearance respectively. To simplify our overview, we use symbol ${*}$ to denote both geometry $g$ and appearance $a$, e.g., $b_{*}(x_i)$ can be either $b_g(x_i)$ or $b_a(x_i)$. For sample points along the ray, we query the basis and coefficient factors for depth and feature integration rendering. 2) Mapping process: Jointly optimize scene representation and camera poses 3) Tracking process: By minimizing the losses, each input camera pose is updated.
Figure 2: Comparison of qualitative results of reconstruction on Replica dataset replica.
Figure 3: Comparison of qualitative results of reconstruction on Replica dataset replica. We visualize untextured meshes.
Figure 4: Comparison of qualitative results of reconstruction on ScanNet scannet.
Figure 5: Qualitative reconstruction on ScanNet dataset scannet. We visualize untextured meshes.
...and 3 more figures

DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System

TL;DR

Abstract

DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System

Authors

TL;DR

Abstract

Table of Contents

Figures (8)