Spatiotemporal Calibration and Ground Truth Estimation for High-Precision SLAM Benchmarking in Extended Reality
Zichao Shu, Shitao Bei, Lijun Li, Zetao Chen
TL;DR
The paper tackles the challenge of providing high-precision ground truth for XR-SLAM benchmarking by addressing spatiotemporal calibration and MoCap jitter. It introduces a continuous-time maximum-likelihood estimator that fuses marker-based MoCap, an auxiliary IMU, and the device under test to jointly estimate trajectories and sensor extrinsics, incorporating a variable time offset and screw-congruence weighting. Time-varying states are modeled with $SE(3)$ and $\mathbb{R}^n$ splines, enabling smooth, differentiable fusion of asynchronous high-rate data and efficient batch optimization via $\text{Ceres}$ solver. Extensive simulations and real-world experiments show the method surpasses existing GT approaches, achieving ARE/ATE below 0.2°/2 mm and RRE/RTE below 0.02°/0.2 mm, and enabling rigorous XR SLAM benchmarking across multiple devices.
Abstract
Simultaneous localization and mapping (SLAM) plays a fundamental role in extended reality (XR) applications. As the standards for immersion in XR continue to increase, the demands for SLAM benchmarking have become more stringent. Trajectory accuracy is the key metric, and marker-based optical motion capture (MoCap) systems are widely used to generate ground truth (GT) because of their drift-free and relatively accurate measurements. However, the precision of MoCap-based GT is limited by two factors: the spatiotemporal calibration with the device under test (DUT) and the inherent jitter in the MoCap measurements. These limitations hinder accurate SLAM benchmarking, particularly for key metrics like rotation error and inter-frame jitter, which are critical for immersive XR experiences. This paper presents a novel continuous-time maximum likelihood estimator to address these challenges. The proposed method integrates auxiliary inertial measurement unit (IMU) data to compensate for MoCap jitter. Additionally, a variable time synchronization method and a pose residual based on screw congruence constraints are proposed, enabling precise spatiotemporal calibration across multiple sensors and the DUT. Experimental results demonstrate that our approach outperforms existing methods, achieving the precision necessary for comprehensive benchmarking of state-of-the-art SLAM algorithms in XR applications. Furthermore, we thoroughly validate the practicality of our method by benchmarking several leading XR devices and open-source SLAM algorithms. The code is publicly available at https://github.com/ylab-xrpg/xr-hpgt.
