Cross-Modal Semi-Dense 6-DoF Tracking of an Event Camera in Challenging Conditions
Yi-Fan Zuo, Wanting Xu, Xia Wang, Yifu Wang, Laurent Kneip
TL;DR
This work addresses robust visual localization under challenging conditions by enabling cross-modal tracking of an event camera against semi-dense 3D priors from either a depth sensor or a prior image-based map. It introduces Signed Time Surface Maps (STSMs) to exploit event polarity and employs a priority-aware occlusion culling strategy via ANNFs, enabling fast, reliable 6-DoF pose tracking. The paper presents two pipelines: Canny-DEVO (local, depth-assisted mapping) and Canny-EVT (global map-based tracking using events), and demonstrates superior performance over purely event-based and RGB-D methods across HDR, motion, and illumination variations. The results highlight the practical viability of cross-modal event-based tracking for real-time localization in robotics and AR, with open-source code to foster adoption. The contributions advance edge-based, semi-dense registration by combining polarity-aware registration, occlusion handling, and cross-modal priors to achieve robust, efficient tracking in challenging environments.
Abstract
Vision-based localization is a cost-effective and thus attractive solution for many intelligent mobile platforms. However, its accuracy and especially robustness still suffer from low illumination conditions, illumination changes, and aggressive motion. Event-based cameras are bio-inspired visual sensors that perform well in HDR conditions and have high temporal resolution, and thus provide an interesting alternative in such challenging scenarios. While purely event-based solutions currently do not yet produce satisfying mapping results, the present work demonstrates the feasibility of purely event-based tracking if an alternative sensor is permitted for mapping. The method relies on geometric 3D-2D registration of semi-dense maps and events, and achieves highly reliable and accurate cross-modal tracking results. Practically relevant scenarios are given by depth camera-supported tracking or map-based localization with a semi-dense map prior created by a regular image-based visual SLAM or structure-from-motion system. Conventional edge-based 3D-2D alignment is extended by a novel polarity-aware registration that makes use of signed time-surface maps (STSM) obtained from event streams. We furthermore introduce a novel culling strategy for occluded points. Both modifications increase the speed of the tracker and its robustness against occlusions or large view-point variations. The approach is validated on many real datasets covering the above-mentioned challenging conditions, and compared against similar solutions realised with regular cameras.
