Table of Contents
Fetching ...

InCrowd-VI: A Realistic Visual-Inertial Dataset for Evaluating SLAM in Indoor Pedestrian-Rich Spaces for Human Navigation

Marziyeh Bamdad, Hans-Peter Hutter, Alireza Darvishy

TL;DR

The paper addresses the lack of realistic data for evaluating SLAM in visually impaired navigation within crowded indoor spaces. It delivers InCrowd-VI, a head-worn, visual-inertial dataset with 58 sequences (~5 km, ~1.5 h), ground-truth trajectories (~2 cm accuracy), and semi-dense 3D maps, collected across diverse indoor venues using Meta Aria glasses. An evaluation of state-of-the-art VO/SLAM systems reveals substantial performance gaps in crowded, dynamic conditions, with deep-learning approaches offering high pose coverage but failing to run in real time. The dataset serves as a realistic benchmark to drive the development of real-time, robust SLAM tailored to visually impaired navigation, while also highlighting practical limitations and directions for improvement.

Abstract

Simultaneous localization and mapping (SLAM) techniques can be used to navigate the visually impaired, but the development of robust SLAM solutions for crowded spaces is limited by the lack of realistic datasets. To address this, we introduce InCrowd-VI, a novel visual-inertial dataset specifically designed for human navigation in indoor pedestrian-rich environments. Recorded using Meta Aria Project glasses, it captures realistic scenarios without environmental control. InCrowd-VI features 58 sequences totaling a 5 km trajectory length and 1.5 hours of recording time, including RGB, stereo images, and IMU measurements. The dataset captures important challenges such as pedestrian occlusions, varying crowd densities, complex layouts, and lighting changes. Ground-truth trajectories, accurate to approximately 2 cm, are provided in the dataset, originating from the Meta Aria project machine perception SLAM service. In addition, a semi-dense 3D point cloud of scenes is provided for each sequence. The evaluation of state-of-the-art visual odometry (VO) and SLAM algorithms on InCrowd-VI revealed severe performance limitations in these realistic scenarios. Under challenging conditions, systems exceeded the required localization accuracy of 0.5 meters and the 1\% drift threshold, with classical methods showing drift up to 5-10\%. While deep learning-based approaches maintained high pose estimation coverage (>90\%), they failed to achieve real-time processing speeds necessary for walking pace navigation. These results demonstrate the need and value of a new dataset to advance SLAM research for visually impaired navigation in complex indoor environments. The dataset and associated tools are publicly available at https://incrowd-vi.cloudlab.zhaw.ch/.

InCrowd-VI: A Realistic Visual-Inertial Dataset for Evaluating SLAM in Indoor Pedestrian-Rich Spaces for Human Navigation

TL;DR

The paper addresses the lack of realistic data for evaluating SLAM in visually impaired navigation within crowded indoor spaces. It delivers InCrowd-VI, a head-worn, visual-inertial dataset with 58 sequences (~5 km, ~1.5 h), ground-truth trajectories (~2 cm accuracy), and semi-dense 3D maps, collected across diverse indoor venues using Meta Aria glasses. An evaluation of state-of-the-art VO/SLAM systems reveals substantial performance gaps in crowded, dynamic conditions, with deep-learning approaches offering high pose coverage but failing to run in real time. The dataset serves as a realistic benchmark to drive the development of real-time, robust SLAM tailored to visually impaired navigation, while also highlighting practical limitations and directions for improvement.

Abstract

Simultaneous localization and mapping (SLAM) techniques can be used to navigate the visually impaired, but the development of robust SLAM solutions for crowded spaces is limited by the lack of realistic datasets. To address this, we introduce InCrowd-VI, a novel visual-inertial dataset specifically designed for human navigation in indoor pedestrian-rich environments. Recorded using Meta Aria Project glasses, it captures realistic scenarios without environmental control. InCrowd-VI features 58 sequences totaling a 5 km trajectory length and 1.5 hours of recording time, including RGB, stereo images, and IMU measurements. The dataset captures important challenges such as pedestrian occlusions, varying crowd densities, complex layouts, and lighting changes. Ground-truth trajectories, accurate to approximately 2 cm, are provided in the dataset, originating from the Meta Aria project machine perception SLAM service. In addition, a semi-dense 3D point cloud of scenes is provided for each sequence. The evaluation of state-of-the-art visual odometry (VO) and SLAM algorithms on InCrowd-VI revealed severe performance limitations in these realistic scenarios. Under challenging conditions, systems exceeded the required localization accuracy of 0.5 meters and the 1\% drift threshold, with classical methods showing drift up to 5-10\%. While deep learning-based approaches maintained high pose estimation coverage (>90\%), they failed to achieve real-time processing speeds necessary for walking pace navigation. These results demonstrate the need and value of a new dataset to advance SLAM research for visually impaired navigation in complex indoor environments. The dataset and associated tools are publicly available at https://incrowd-vi.cloudlab.zhaw.ch/.

Paper Structure

This paper contains 17 sections, 13 equations, 8 figures, 6 tables.

Figures (8)

  • Figure S1: Sample of manual measurement process for ground-truth validation. Left: Real-world scene with a landmark floor tile highlighted by pink rectangle. Middle: Full 3D point cloud map of the scene with four adjacent floor tiles marked in blue. Right: Zoomed view of the marked corner of the tiles in the point cloud used for measurement.
  • Figure S2: Correlation between real-world measurements and point-cloud-derived distances in challenging sequences, where state-of-the-art SLAM systems exhibited failure or suboptimal performance. The scatter plot demonstrates a strong linear relationship between real-world and measured distances (in centimeters), with an average error of 2.14 cm, standard deviation of 1.46 cm, and median error of 2.0 cm.
  • Figure S3: Refined 3D reconstruction demonstrating the removal of dynamic pedestrians that initially appeared static relative to the camera on the escalator.
  • Figure S4: Example of image data and corresponding 3D map from a dataset sequence: The top-left image shows the RGB frame, and the top-middle and top-right images represent the left and right images of a stereo pair. The bottom image shows the 3D map of the scene.
  • Figure S5: Distribution of challenges across sequences in the InCrowd-VI dataset, categorized by crowd density levels (High: >10 pedestrians per frame, Medium: 4-10 pedestrians, Low: 1-3 pedestrians, None: no pedestrians). The x-axis represents the different types of challenges, and the y-axis indicates the total number of sequences. Note that the sequences may contain multiple challenges simultaneously.
  • ...and 3 more figures