Table of Contents
Fetching ...

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

Anusha Krishnan, Shaohui Liu, Paul-Edouard Sarlin, Oscar Gentilhomme, David Caruso, Maurizio Monge, Richard Newcombe, Jakob Engel, Marc Pollefeys

TL;DR

The paper presents LaMAria, a city-scale egocentric VIO/SLAM benchmark collected with Project Aria glasses to address the gap between academic datasets and real wearable scenarios. It introduces centimeter-accurate ground-truth via surveying control points and a dense pseudo-ground-truth pipeline, along with a controlled experimental set and a main benchmark spanning challenging conditions such as low light, moving platforms, and time-varying calibration. Systematic evaluations show that state-of-the-art academic methods struggle under egocentric conditions, while a commercial SLAM system based on Aria’s API remains more robust, underscoring the need for online calibration, loop closure, and dynamic adaptation. The dataset and evaluation protocol, including CP@1m and R@5m metrics, enable principled progress for multi-sensor SLAM in unconstrained wearables, and the work publicly releases the data to catalyze future research.

Abstract

Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at https://www.lamaria.ethz.ch.

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

TL;DR

The paper presents LaMAria, a city-scale egocentric VIO/SLAM benchmark collected with Project Aria glasses to address the gap between academic datasets and real wearable scenarios. It introduces centimeter-accurate ground-truth via surveying control points and a dense pseudo-ground-truth pipeline, along with a controlled experimental set and a main benchmark spanning challenging conditions such as low light, moving platforms, and time-varying calibration. Systematic evaluations show that state-of-the-art academic methods struggle under egocentric conditions, while a commercial SLAM system based on Aria’s API remains more robust, underscoring the need for online calibration, loop closure, and dynamic adaptation. The dataset and evaluation protocol, including CP@1m and R@5m metrics, enable principled progress for multi-sensor SLAM in unconstrained wearables, and the work publicly releases the data to catalyze future research.

Abstract

Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at https://www.lamaria.ethz.ch.

Paper Structure

This paper contains 51 sections, 6 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Challenges: LaMAria includes sensor data recorded by head-mounted devices following outdoor and indoor trajectories in diverse conditions and environments that impair the perceived visual information and are thus challenging for existing algorithms.
  • Figure 2: Control points (a) are measured with centimeter-accuracy by surveying instruments and are (b) automatically detected in Aria's images using (c) fiducial markers.
  • Figure 3: Types of ground-truth. Left: Any trajectory can be evaluated with high accuracy via sparse alignment against the GT control points (CPs). Right: We also compute GT camera poses, which are denser but less accurate, via a joint multi-sensor optimization.
  • Figure 4: Cross-validation of CPs for sparse alignment. The uncertainty of the triangulations and CP measurements are orders of magnitude smaller than the CP error that we evaluate, in both 2D (left) and 3D (right), validating that our sparse GT is sufficiently accurate for evaluation.
  • Figure 5: Distribution of the whitened visual and IMU residuals.
  • ...and 12 more figures