Table of Contents
Fetching ...

EVIT: Event-based Visual-Inertial Tracking in Semi-Dense Maps Using Windowed Nonlinear Optimization

Runze Yuan, Tao Liu, Zijia Dai, Yi-Fan Zuo, Laurent Kneip

TL;DR

EVIT addresses robust ego-motion tracking for event cameras when a prior semi-dense map is available, targeting challenging dynamics and illumination. The method fuses IMU pre-integration with time-surface-map (TSM) based edge alignment in a sliding-window, nonlinear optimization, treating multiple keyframes as a virtual elastic multi-camera rig with inter-frame constraints. Key contributions include adaptive keyframe generation, a time-surface-based event representation, tightly coupled visual-inertial initialization, and a sliding-window back-end that integrates IMU and TSM observations. Experimental results on the VECtor dataset show EVIT outperforms purely event-based methods, especially in dynamic sequences, while reducing the rate of intermediate event registrations and maintaining real-time capability; the approach generalizes beyond event cameras to regular cameras through similar windowed tracking.

Abstract

Event cameras are an interesting visual exteroceptive sensor that reacts to brightness changes rather than integrating absolute image intensities. Owing to this design, the sensor exhibits strong performance in situations of challenging dynamics and illumination conditions. While event-based simultaneous tracking and mapping remains a challenging problem, a number of recent works have pointed out the sensor's suitability for prior map-based tracking. By making use of cross-modal registration paradigms, the camera's ego-motion can be tracked across a large spectrum of illumination and dynamics conditions on top of accurate maps that have been created a priori by more traditional sensors. The present paper follows up on a recently introduced event-based geometric semi-dense tracking paradigm, and proposes the addition of inertial signals in order to robustify the estimation. More specifically, the added signals provide strong cues for pose initialization as well as regularization during windowed, multi-frame tracking. As a result, the proposed framework achieves increased performance under challenging illumination conditions as well as a reduction of the rate at which intermediate event representations need to be registered in order to maintain stable tracking across highly dynamic sequences. Our evaluation focuses on a diverse set of real world sequences and comprises a comparison of our proposed method against a purely event-based alternative running at different rates.

EVIT: Event-based Visual-Inertial Tracking in Semi-Dense Maps Using Windowed Nonlinear Optimization

TL;DR

EVIT addresses robust ego-motion tracking for event cameras when a prior semi-dense map is available, targeting challenging dynamics and illumination. The method fuses IMU pre-integration with time-surface-map (TSM) based edge alignment in a sliding-window, nonlinear optimization, treating multiple keyframes as a virtual elastic multi-camera rig with inter-frame constraints. Key contributions include adaptive keyframe generation, a time-surface-based event representation, tightly coupled visual-inertial initialization, and a sliding-window back-end that integrates IMU and TSM observations. Experimental results on the VECtor dataset show EVIT outperforms purely event-based methods, especially in dynamic sequences, while reducing the rate of intermediate event registrations and maintaining real-time capability; the approach generalizes beyond event cameras to regular cameras through similar windowed tracking.

Abstract

Event cameras are an interesting visual exteroceptive sensor that reacts to brightness changes rather than integrating absolute image intensities. Owing to this design, the sensor exhibits strong performance in situations of challenging dynamics and illumination conditions. While event-based simultaneous tracking and mapping remains a challenging problem, a number of recent works have pointed out the sensor's suitability for prior map-based tracking. By making use of cross-modal registration paradigms, the camera's ego-motion can be tracked across a large spectrum of illumination and dynamics conditions on top of accurate maps that have been created a priori by more traditional sensors. The present paper follows up on a recently introduced event-based geometric semi-dense tracking paradigm, and proposes the addition of inertial signals in order to robustify the estimation. More specifically, the added signals provide strong cues for pose initialization as well as regularization during windowed, multi-frame tracking. As a result, the proposed framework achieves increased performance under challenging illumination conditions as well as a reduction of the rate at which intermediate event representations need to be registered in order to maintain stable tracking across highly dynamic sequences. Our evaluation focuses on a diverse set of real world sequences and comprises a comparison of our proposed method against a purely event-based alternative running at different rates.
Paper Structure (15 sections, 10 equations, 5 figures, 3 tables)

This paper contains 15 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of EVIT. Rather than registering only single time-surface maps with respect to a semi-dense point cloud, we propose to do windowed joint registration of multiple adjacent TSMs, which improves registration stability. The added IMU integration terms form the connections between adjacent keyframes, thereby creating a virtual multi-camera rig with elastic connections.
  • Figure 2: Block diagram of the full event-based visual-inertial tracking pipeline. The system takes stream of events and IMU measurements (colored block) as input and tracks against the reconstructed semi-dense map. The measurement processing module (Section \ref{['subsec:measurement processing']}) dynamically choose keyframes and process raw data stream into usable single frame observations. The initialization (Section \ref{['subsec:init']}) module utilizes high frequency event localization results to provide bootstrapping states for subsequent prediction and optimization. The optimization module (Section \ref{['subsec:optimization']}) tightly fuses IMU pre-integration measurements and TSM representations to achieve accurate state estimation.
  • Figure 3: Factor graph representation for our sliding window optimization. The graph fuses observations from the event camera, the IMU, and the semi-dense map. Two types of factors are introduced to construct the factor graph: (a) IMU pre-integration factors, (b) event alignment factors using TSMs. The formulation of these factors are discussed in Sections \ref{['subsec:init']} and \ref{['subsec:optimization']}. We fix the first node to maintain consistency with matured nodes that left the window.
  • Figure 4: Comparison of trajectories generated by various pure and inertial-supported semi-dense event-based tracking solutions, as well as a photometric alternative. The sequence is sofa_fast from the VECtor benchmark gao2022vector. The method proposed in this work is called EVIT.
  • Figure 5: Initial projection of semi-dense cloud on TSM using different motion models.