Table of Contents
Fetching ...

Have We Mastered Scale in Deep Monocular Visual SLAM? The ScaleMaster Dataset and Benchmark

Hyoseok Ju, Bokeon Suh, Giseop Kim

TL;DR

The ScaleMaster Dataset is introduced, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions and systematically analyzes the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency.

Abstract

Recent advances in deep monocular visual Simultaneous Localization and Mapping (SLAM) have achieved impressive accuracy and dense reconstruction capabilities, yet their robustness to scale inconsistency in large-scale indoor environments remains largely unexplored. Existing benchmarks are limited to room-scale or structurally simple settings, leaving critical issues of intra-session scale drift and inter-session scale ambiguity insufficiently addressed. To fill this gap, we introduce the ScaleMaster Dataset, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions. We systematically analyze the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency, providing both quantitative and qualitative evaluations. Crucially, our analysis extends beyond traditional trajectory metrics to include a direct map-to-map quality assessment using metrics like Chamfer distance against high-fidelity 3D ground truth. Our results reveal that while recent deep monocular visual SLAM systems demonstrate strong performance on existing benchmarks, they suffer from severe scale-related failures in realistic, large-scale indoor environments. By releasing the ScaleMaster dataset and baseline results, we aim to establish a foundation for future research toward developing scale-consistent and reliable visual SLAM systems.

Have We Mastered Scale in Deep Monocular Visual SLAM? The ScaleMaster Dataset and Benchmark

TL;DR

The ScaleMaster Dataset is introduced, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions and systematically analyzes the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency.

Abstract

Recent advances in deep monocular visual Simultaneous Localization and Mapping (SLAM) have achieved impressive accuracy and dense reconstruction capabilities, yet their robustness to scale inconsistency in large-scale indoor environments remains largely unexplored. Existing benchmarks are limited to room-scale or structurally simple settings, leaving critical issues of intra-session scale drift and inter-session scale ambiguity insufficiently addressed. To fill this gap, we introduce the ScaleMaster Dataset, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions. We systematically analyze the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency, providing both quantitative and qualitative evaluations. Crucially, our analysis extends beyond traditional trajectory metrics to include a direct map-to-map quality assessment using metrics like Chamfer distance against high-fidelity 3D ground truth. Our results reveal that while recent deep monocular visual SLAM systems demonstrate strong performance on existing benchmarks, they suffer from severe scale-related failures in realistic, large-scale indoor environments. By releasing the ScaleMaster dataset and baseline results, we aim to establish a foundation for future research toward developing scale-consistent and reliable visual SLAM systems.
Paper Structure (18 sections, 9 figures, 5 tables)

This paper contains 18 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Representative example of scale inconsistency. During Sim(3) pose-graph optimization, the SLAM trajectory experiences a sudden scale explosion (highlighted), resulting in an abnormally enlarged reconstruction that diverges from the previously estimated map.
  • Figure 2: A diagram illustrating the components of the scale estimation problem.
  • Figure 3: A visual comparison of trajectory scales between our proposed ScaleMaster dataset and existing standard benchmarks. This distinct contrast visually demonstrates why existing room-scale benchmarks are insufficient for evaluating long-term scale consistency failures.
  • Figure 4: Our overall experimental pipeline, illustrating the process from data acquisition with our custom rig (left), through ground truth map generation and SLAM processing (center), to the final map-to-map error calculation (right).
  • Figure 5: Qualitative comparison of 3D reconstruction of the Library_06 sequence. (a) Ground truth point cloud from LiDAR, color-coded by height. (b) The map reconstructed by MASt3R-SLAM. (c) The alignment of the MASt3R-SLAM map onto the ground truth. (d) Point-to-point distance error visualization, where warmer colors (red) indicate larger geometric inconsistencies between the two maps
  • ...and 4 more figures