DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping System

Abanob Soliman; Fabien Bonardi; Désiré Sidibé; Samia Bouchafa

DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping System

Abanob Soliman, Fabien Bonardi, Désiré Sidibé, Samia Bouchafa

TL;DR

DH-PTAM addresses robust SLAM in challenging HDR and dynamic environments by fusing stereo frames with stereo event streams in a unified reference frame. It introduces spatio-temporal synchronization, a depth-aware spatial hybridization, and an end-to-end optimization backbone with learning-based descriptors and a lightweight loop-closure mechanism. Experimental results on the VECtor and TUM-VIE benchmarks show improved accuracy and robustness over RGB-only and some event-based baselines, particularly in HDR scenarios, with GPU-enabled front-ends offering further gains. The work provides a practical, scalable DH-PTAM implementation and outlines directions for online synchronization optimization and adaptive temporal-windowing to enhance real-time performance.

Abstract

This paper presents a robust approach for a visual parallel tracking and mapping (PTAM) system that excels in challenging environments. Our proposed method combines the strengths of heterogeneous multi-modal visual sensors, including stereo event-based and frame-based sensors, in a unified reference frame through a novel spatio-temporal synchronization of stereo visual frames and stereo event streams. We employ deep learning-based feature extraction and description for estimation to enhance robustness further. We also introduce an end-to-end parallel tracking and mapping optimization layer complemented by a simple loop-closure algorithm for efficient SLAM behavior. Through comprehensive experiments on both small-scale and large-scale real-world sequences of VECtor and TUM-VIE benchmarks, our proposed method (DH-PTAM) demonstrates superior performance in terms of robustness and accuracy in adverse conditions, especially in large-scale HDR scenarios. Our implementation's research-based Python API is publicly available on GitHub for further research and development: https://github.com/AbanobSoliman/DH-PTAM.

DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping System

TL;DR

Abstract

Paper Structure (12 sections, 9 equations, 9 figures, 4 tables)

This paper contains 12 sections, 9 equations, 9 figures, 4 tables.

Introduction
Related Work
Methodology
System Overview
Temporal Synchronization Approach
Spatial Hybridization Approach
Optimization-based State Estimation
Evaluation
VECtor large-scale experiments
TUM-VIE small-scale experiments
Ablation experiments
Conclusion

Figures (9)

Figure 1: Experiments on school-scooter and corner-slow sequences from the VECtor dataset show the estimated trajectory with the constructed scene map (green dotted rectangle). The red dotted rectangle highlights an HDR use-case where DH-PTAM estimates the trajectory continuously based on the two fusion modes (Dynamic Vision Sensor (DVS) or Active Pixel Sensor (APS) biased). APS: denotes the standard camera's global shutter frames.
Figure 2: Block diagram of the proposed hybrid event-aided stereo visual odometry approach (DH-PTAM). $f$ denotes the fusion function defined in \ref{['eqn:fusion']}. $\Delta{t}^k$ is the event volume $\mathcal{V}_0(x,y,t)$ accumulation time defined in \ref{['eqn:volume']}. $\text{E3CT}$ denotes the Event 3-Channel Tensor ibiscape, an image-like event representation.
Figure 3: Temporal synchronization scheme. $t_{exp}$ is the global shutter camera exposure time. $\Delta{t}$ is the event representation (E3CT) volume accumulation window. $t_f$ is the fusion frame calculated timestamp. $t_{\text{DVS},\text{CAM}}$ are the DVS events, and RGB camera frames timestamps, respectively.
Figure 4: Ablation study on reducing the temporal window width versus controlling the number of events in the designed window. All event frames are post-processed E3CTs by median filtering followed by a binary threshold.
Figure 5: Geometry of the stereo hybrid event-RGB cameras stack. $\mathcal{T}_{cd}$ denotes the rigid-body transformations. $P^h_{d\in c}\;\text{and}\;P^h_{d}$ denote pixels locations.
...and 4 more figures

DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping System

TL;DR

Abstract

DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping System

Authors

TL;DR

Abstract

Table of Contents

Figures (9)