Table of Contents
Fetching ...

Spatiotemporal Calibration and Ground Truth Estimation for High-Precision SLAM Benchmarking in Extended Reality

Zichao Shu, Shitao Bei, Lijun Li, Zetao Chen

TL;DR

The paper tackles the challenge of providing high-precision ground truth for XR-SLAM benchmarking by addressing spatiotemporal calibration and MoCap jitter. It introduces a continuous-time maximum-likelihood estimator that fuses marker-based MoCap, an auxiliary IMU, and the device under test to jointly estimate trajectories and sensor extrinsics, incorporating a variable time offset and screw-congruence weighting. Time-varying states are modeled with $SE(3)$ and $\mathbb{R}^n$ splines, enabling smooth, differentiable fusion of asynchronous high-rate data and efficient batch optimization via $\text{Ceres}$ solver. Extensive simulations and real-world experiments show the method surpasses existing GT approaches, achieving ARE/ATE below 0.2°/2 mm and RRE/RTE below 0.02°/0.2 mm, and enabling rigorous XR SLAM benchmarking across multiple devices.

Abstract

Simultaneous localization and mapping (SLAM) plays a fundamental role in extended reality (XR) applications. As the standards for immersion in XR continue to increase, the demands for SLAM benchmarking have become more stringent. Trajectory accuracy is the key metric, and marker-based optical motion capture (MoCap) systems are widely used to generate ground truth (GT) because of their drift-free and relatively accurate measurements. However, the precision of MoCap-based GT is limited by two factors: the spatiotemporal calibration with the device under test (DUT) and the inherent jitter in the MoCap measurements. These limitations hinder accurate SLAM benchmarking, particularly for key metrics like rotation error and inter-frame jitter, which are critical for immersive XR experiences. This paper presents a novel continuous-time maximum likelihood estimator to address these challenges. The proposed method integrates auxiliary inertial measurement unit (IMU) data to compensate for MoCap jitter. Additionally, a variable time synchronization method and a pose residual based on screw congruence constraints are proposed, enabling precise spatiotemporal calibration across multiple sensors and the DUT. Experimental results demonstrate that our approach outperforms existing methods, achieving the precision necessary for comprehensive benchmarking of state-of-the-art SLAM algorithms in XR applications. Furthermore, we thoroughly validate the practicality of our method by benchmarking several leading XR devices and open-source SLAM algorithms. The code is publicly available at https://github.com/ylab-xrpg/xr-hpgt.

Spatiotemporal Calibration and Ground Truth Estimation for High-Precision SLAM Benchmarking in Extended Reality

TL;DR

The paper tackles the challenge of providing high-precision ground truth for XR-SLAM benchmarking by addressing spatiotemporal calibration and MoCap jitter. It introduces a continuous-time maximum-likelihood estimator that fuses marker-based MoCap, an auxiliary IMU, and the device under test to jointly estimate trajectories and sensor extrinsics, incorporating a variable time offset and screw-congruence weighting. Time-varying states are modeled with and splines, enabling smooth, differentiable fusion of asynchronous high-rate data and efficient batch optimization via solver. Extensive simulations and real-world experiments show the method surpasses existing GT approaches, achieving ARE/ATE below 0.2°/2 mm and RRE/RTE below 0.02°/0.2 mm, and enabling rigorous XR SLAM benchmarking across multiple devices.

Abstract

Simultaneous localization and mapping (SLAM) plays a fundamental role in extended reality (XR) applications. As the standards for immersion in XR continue to increase, the demands for SLAM benchmarking have become more stringent. Trajectory accuracy is the key metric, and marker-based optical motion capture (MoCap) systems are widely used to generate ground truth (GT) because of their drift-free and relatively accurate measurements. However, the precision of MoCap-based GT is limited by two factors: the spatiotemporal calibration with the device under test (DUT) and the inherent jitter in the MoCap measurements. These limitations hinder accurate SLAM benchmarking, particularly for key metrics like rotation error and inter-frame jitter, which are critical for immersive XR experiences. This paper presents a novel continuous-time maximum likelihood estimator to address these challenges. The proposed method integrates auxiliary inertial measurement unit (IMU) data to compensate for MoCap jitter. Additionally, a variable time synchronization method and a pose residual based on screw congruence constraints are proposed, enabling precise spatiotemporal calibration across multiple sensors and the DUT. Experimental results demonstrate that our approach outperforms existing methods, achieving the precision necessary for comprehensive benchmarking of state-of-the-art SLAM algorithms in XR applications. Furthermore, we thoroughly validate the practicality of our method by benchmarking several leading XR devices and open-source SLAM algorithms. The code is publicly available at https://github.com/ylab-xrpg/xr-hpgt.

Paper Structure

This paper contains 22 sections, 23 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of the proposed estimator, which adopts a coarse-to-fine strategy and uses continuous-time batch estimation to perform spatiotemporal calibration and achieve localization trajectory estimation.
  • Figure 2: Illustration of B-splines in $\mathbb{R}^{{3}}$ and on the $SO\left(3\right)$ manifold. Black circles denote control points, and the blue curve represents the cubic B-spline. The linear B-spline is shown as a green dashed line, appearing as a straight line in $\mathbb{R}^{{3}}$ and a geodesic on $SO(3)$.
  • Figure 3: Illustration of the complementary characteristics of MoCap and IMU data. The MoCap (blue) provides globally consistent measurements but introduces high-frequency noise, leading to significant errors in the derivative domain. Meanwhile, the IMU (green) offers robust derivative measurements but suffers from drift due to temporal integration.
  • Figure 4: Factor graph representation of the batch estimation problem, illustrating the connectivity between state variables and measurement factors from multiple sensors.
  • Figure 5: Accuracy comparison of different methods on the simulation dataset, reporting absolute error (ARE/ATE) and inter-frame relative error (RRE/RTE) of the estimated trajectories.
  • ...and 5 more figures