Table of Contents
Fetching ...

GS-EVT: Cross-Modal Event Camera Tracking based on Gaussian Splatting

Tao Liu, Runze Yuan, Yi'ang Ju, Xun Xu, Jiaqi Yang, Xiangting Meng, Xavier Lagorce, Laurent Kneip

TL;DR

This paper explores the use of event cameras for motion tracking thereby providing a solution with inherent robustness under difficult dynamics and illumination and operates on top of gaussian splatting, a state-of-the-art representation that permits highly efficient and realistic novel view synthesis.

Abstract

Reliable self-localization is a foundational skill for many intelligent mobile platforms. This paper explores the use of event cameras for motion tracking thereby providing a solution with inherent robustness under difficult dynamics and illumination. In order to circumvent the challenge of event camera-based mapping, the solution is framed in a cross-modal way. It tracks a map representation that comes directly from frame-based cameras. Specifically, the proposed method operates on top of gaussian splatting, a state-of-the-art representation that permits highly efficient and realistic novel view synthesis. The key of our approach consists of a novel pose parametrization that uses a reference pose plus first order dynamics for local differential image rendering. The latter is then compared against images of integrated events in a staggered coarse-to-fine optimization scheme. As demonstrated by our results, the realistic view rendering ability of gaussian splatting leads to stable and accurate tracking across a variety of both publicly available and newly recorded data sequences.

GS-EVT: Cross-Modal Event Camera Tracking based on Gaussian Splatting

TL;DR

This paper explores the use of event cameras for motion tracking thereby providing a solution with inherent robustness under difficult dynamics and illumination and operates on top of gaussian splatting, a state-of-the-art representation that permits highly efficient and realistic novel view synthesis.

Abstract

Reliable self-localization is a foundational skill for many intelligent mobile platforms. This paper explores the use of event cameras for motion tracking thereby providing a solution with inherent robustness under difficult dynamics and illumination. In order to circumvent the challenge of event camera-based mapping, the solution is framed in a cross-modal way. It tracks a map representation that comes directly from frame-based cameras. Specifically, the proposed method operates on top of gaussian splatting, a state-of-the-art representation that permits highly efficient and realistic novel view synthesis. The key of our approach consists of a novel pose parametrization that uses a reference pose plus first order dynamics for local differential image rendering. The latter is then compared against images of integrated events in a staggered coarse-to-fine optimization scheme. As demonstrated by our results, the realistic view rendering ability of gaussian splatting leads to stable and accurate tracking across a variety of both publicly available and newly recorded data sequences.
Paper Structure (14 sections, 22 equations, 6 figures, 3 tables)

This paper contains 14 sections, 22 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of tracking procedure. The event camera moves 6-DOF freely within the Gaussian map, and high-fidelity intensity images are obtained by rasterizing these Gaussians. We simulate event generation by differentiating between two intensity images. The camera pose is optimized by calculating the photometric error between the intensity changes rendered from the Gaussian map and the actual intensity changes captured by the event camera.
  • Figure 2: Block diagram of GS-EVT pipeline. We build the 3DGS map from a sequence of RGB images. Within this map, the event camera accumulates a keyframe over a short trajectory (green line). The poses of the first and last events (green frustum) within the keyframe are determined in Eq \ref{['eq:T_first_last']}. Using these poses, we render two intensity images from the map. The difference between these images generates a rendered intensity change image. The event keyframe is directly accumulated from the event stream. By utilizing a photometric loss function, we initially perform a polarity-free pose-only optimization in the coarse stage, followed by a refinement of the pose incorporating velocity optimization in the fine stage.
  • Figure 3: According to the constant velocity model, we estimate the initial camera pose (orange frustum) based on the previous pose (gray frustum). The blue line represents the approximated keyframe accumulation trajectory, which spans a duration of $\Delta \tau$ time. $\tau - \frac{\Delta \tau}{2}$ and $\tau + \frac{\Delta \tau}{2}$ denote the first and last event timestamps within that interval. By applying pose optimization ($\mathrm{exp}((\Delta\theta)^{\wedge}), \Delta t$), the initial pose will be adjusted to closely align with the ground truth trajectory. Subsequently, velocity optimization ($\omega, v$) is used to ensure that the keyframe accumulation trajectory is tangent to the GT trajectory (velocity direction) and its length is corrected so that the thickness of rendered event's edge matches the real event's edge (velocity magnitude).
  • Figure 4: Visualization of self-collected dataset. desk sequence with severely occluded scene type (left); keyboard sequence complexly textured scene type (middle); helmet sequence with highly reflective scene type (right).
  • Figure 5: First row: Severely occluded scene type of sequence desk. Reprojected semi-dense point cloud from EVT (left). Second row: Complexly textured scene type of sequence keyboard. Reprojected semi-dense point cloud from EVT (left). Third row: Highly reflective scene type of sequence helmet. Rendered intensity change from EVPT (left).
  • ...and 1 more figures