Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency
Junyu Zhu, Lina Liu, Bofeng Jiang, Feng Wen, Hongbo Zhang, Wanlong Li, Yong Liu
TL;DR
This work tackles dense monocular depth estimation with event cameras by addressing the lack of dense supervision through cross-modal self-supervision using aligned intensity frames. The EMoDepth framework trains a Depth-Net and a Pose-Net with a cross-modal consistency loss, enabling high-frequency depth prediction from events at inference time. A multi-scale skip-connection further enhances feature fusion for depth from sparse event data. Experiments on MVSEC and DSEC show state-of-the-art performance against existing supervised event-based and unsupervised frame-based methods, with real-time inference capabilities and robust performance in challenging lighting and motion conditions.
Abstract
An event camera is a novel vision sensor that can capture per-pixel brightness changes and output a stream of asynchronous ``events''. It has advantages over conventional cameras in those scenes with high-speed motions and challenging lighting conditions because of the high temporal resolution, high dynamic range, low bandwidth, low power consumption, and no motion blur. Therefore, several supervised monocular depth estimation from events is proposed to address scenes difficult for conventional cameras. However, depth annotation is costly and time-consuming. In this paper, to lower the annotation cost, we propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. Moreover, in inference, only events are used for monocular depth prediction. Additionally, we design a multi-scale skip-connection architecture to effectively fuse features for depth estimation while maintaining high inference speed. Experiments on MVSEC and DSEC datasets demonstrate that our contributions are effective and that the accuracy can outperform existing supervised event-based and unsupervised frame-based methods.
