Table of Contents
Fetching ...

EV-MGDispNet: Motion-Guided Event-Based Stereo Disparity Estimation Network with Left-Right Consistency

Junjie Jiang, Hao Zhuang, Xinjie Huang, Delei Kong, Zheng Fang

TL;DR

This work tackles stereo disparity estimation with event cameras by exploiting temporal information and addressing pixel-shift artifacts in feature maps. It introduces EV-MGDispNet, which combines an edge-aware aggregation (EAA) of event streams with motion-guided attention (MGA) to produce edge-rich representations and high-quality cost volumes, reinforced by a left-right consistency census loss. The method achieves state-of-the-art MAE and RMSE on the DSEC dataset and provides qualitative improvements in edge fidelity and planar regions. The approach demonstrates the practical potential of motion-aware, event-based stereo pipelines for robust depth perception in dynamic driving scenes.

Abstract

Event cameras have the potential to revolutionize the field of robot vision, particularly in areas like stereo disparity estimation, owing to their high temporal resolution and high dynamic range. Many studies use deep learning for event camera stereo disparity estimation. However, these methods fail to fully exploit the temporal information in the event stream to acquire clear event representations. Additionally, there is room for further reduction in pixel shifts in the feature maps before constructing the cost volume. In this paper, we propose EV-MGDispNet, a novel event-based stereo disparity estimation method. Firstly, we propose an edge-aware aggregation (EAA) module, which fuses event frames and motion confidence maps to generate a novel clear event representation. Then, we propose a motion-guided attention (MGA) module, where motion confidence maps utilize deformable transformer encoders to enhance the feature map with more accurate edges. Finally, we also add a census left-right consistency loss function to enhance the left-right consistency of stereo event representation. Through conducting experiments within challenging real-world driving scenarios, we validate that our method outperforms currently known state-of-the-art methods in terms of mean absolute error (MAE) and root mean square error (RMSE) metrics.

EV-MGDispNet: Motion-Guided Event-Based Stereo Disparity Estimation Network with Left-Right Consistency

TL;DR

This work tackles stereo disparity estimation with event cameras by exploiting temporal information and addressing pixel-shift artifacts in feature maps. It introduces EV-MGDispNet, which combines an edge-aware aggregation (EAA) of event streams with motion-guided attention (MGA) to produce edge-rich representations and high-quality cost volumes, reinforced by a left-right consistency census loss. The method achieves state-of-the-art MAE and RMSE on the DSEC dataset and provides qualitative improvements in edge fidelity and planar regions. The approach demonstrates the practical potential of motion-aware, event-based stereo pipelines for robust depth perception in dynamic driving scenes.

Abstract

Event cameras have the potential to revolutionize the field of robot vision, particularly in areas like stereo disparity estimation, owing to their high temporal resolution and high dynamic range. Many studies use deep learning for event camera stereo disparity estimation. However, these methods fail to fully exploit the temporal information in the event stream to acquire clear event representations. Additionally, there is room for further reduction in pixel shifts in the feature maps before constructing the cost volume. In this paper, we propose EV-MGDispNet, a novel event-based stereo disparity estimation method. Firstly, we propose an edge-aware aggregation (EAA) module, which fuses event frames and motion confidence maps to generate a novel clear event representation. Then, we propose a motion-guided attention (MGA) module, where motion confidence maps utilize deformable transformer encoders to enhance the feature map with more accurate edges. Finally, we also add a census left-right consistency loss function to enhance the left-right consistency of stereo event representation. Through conducting experiments within challenging real-world driving scenarios, we validate that our method outperforms currently known state-of-the-art methods in terms of mean absolute error (MAE) and root mean square error (RMSE) metrics.
Paper Structure (24 sections, 27 equations, 6 figures, 4 tables)

This paper contains 24 sections, 27 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The stereo disparity estimation results of event cameras at daytime tunnel exits and nighttime urban roads (extracted from the interlaken$\_$00$\_$b and zurich$\_$city$\_$12$\_$a test sequences from the DSEC datasetDSEC) are evaluated. Our stereo disparity estimation method demonstrates good performance in these scenarios. (Event frames in (b) are aggregated edge-modulated event frames.)
  • Figure 2: A detailed description of the proposed EV-MGDispNet pipeline is as follows. Firstly, the MES is used as input to the EAA module, which generates a clear aggregated edge-modulated event frame by fusing temporal information through motion confidence maps. Afterwards, the MGA module is utilized to extract and integrate motion confidence maps with aggregated edge-modulated event frame features, yielding an edge-enhanced feature map. Next, the cost volume is computed using the left-right edge-enhanced feature maps and aggregated using ISA and CSA modules. Finally, the refined module is used to obtain the stereo disparity estimation results. We use the smooth $L_{1}$ loss and left-right consistency census loss to train our network.
  • Figure 3: Illustration of the SPADE layer structure is as follows. It utilizes motion confidence maps to obtain tensors $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$, which modulate the distribution of features $\boldsymbol{D}_{l-1\uparrow}$ from the decoder.
  • Figure 4: Qualitative evaluation of our best model compared with SOTA models on sequences from the DSECDSEC dataset. These scenarios comprise three locations: Interlaken, Thun and Zurich City. (a) Images (only for visualization), (b) Aggregated edge-modulated event frames, (c) Motion confidence map (the brighter areas have higher confidence levels), (d) Disparity maps of Concentration Netnam2022stereo, (e) Disparity maps of EV-MGDispNet (ours).
  • Figure 5: Module qualitative ablation. Sequentially adding MGA module, EAA module and left-right consistency census loss leads to a reduction in disparity estimation errors pertaining to scene contours and intricate structural elements. (a) Images (only for visualization) (b) w/o MGA module, EAA module, census loss, (c) w/o EAA module, census loss, (d) w/o cenloss, (e) EV-MGDispNet (Ours).
  • ...and 1 more figures