Table of Contents
Fetching ...

Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency

Junyu Zhu, Lina Liu, Bofeng Jiang, Feng Wen, Hongbo Zhang, Wanlong Li, Yong Liu

TL;DR

This work tackles dense monocular depth estimation with event cameras by addressing the lack of dense supervision through cross-modal self-supervision using aligned intensity frames. The EMoDepth framework trains a Depth-Net and a Pose-Net with a cross-modal consistency loss, enabling high-frequency depth prediction from events at inference time. A multi-scale skip-connection further enhances feature fusion for depth from sparse event data. Experiments on MVSEC and DSEC show state-of-the-art performance against existing supervised event-based and unsupervised frame-based methods, with real-time inference capabilities and robust performance in challenging lighting and motion conditions.

Abstract

An event camera is a novel vision sensor that can capture per-pixel brightness changes and output a stream of asynchronous ``events''. It has advantages over conventional cameras in those scenes with high-speed motions and challenging lighting conditions because of the high temporal resolution, high dynamic range, low bandwidth, low power consumption, and no motion blur. Therefore, several supervised monocular depth estimation from events is proposed to address scenes difficult for conventional cameras. However, depth annotation is costly and time-consuming. In this paper, to lower the annotation cost, we propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. Moreover, in inference, only events are used for monocular depth prediction. Additionally, we design a multi-scale skip-connection architecture to effectively fuse features for depth estimation while maintaining high inference speed. Experiments on MVSEC and DSEC datasets demonstrate that our contributions are effective and that the accuracy can outperform existing supervised event-based and unsupervised frame-based methods.

Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency

TL;DR

This work tackles dense monocular depth estimation with event cameras by addressing the lack of dense supervision through cross-modal self-supervision using aligned intensity frames. The EMoDepth framework trains a Depth-Net and a Pose-Net with a cross-modal consistency loss, enabling high-frequency depth prediction from events at inference time. A multi-scale skip-connection further enhances feature fusion for depth from sparse event data. Experiments on MVSEC and DSEC show state-of-the-art performance against existing supervised event-based and unsupervised frame-based methods, with real-time inference capabilities and robust performance in challenging lighting and motion conditions.

Abstract

An event camera is a novel vision sensor that can capture per-pixel brightness changes and output a stream of asynchronous ``events''. It has advantages over conventional cameras in those scenes with high-speed motions and challenging lighting conditions because of the high temporal resolution, high dynamic range, low bandwidth, low power consumption, and no motion blur. Therefore, several supervised monocular depth estimation from events is proposed to address scenes difficult for conventional cameras. However, depth annotation is costly and time-consuming. In this paper, to lower the annotation cost, we propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. Moreover, in inference, only events are used for monocular depth prediction. Additionally, we design a multi-scale skip-connection architecture to effectively fuse features for depth estimation while maintaining high inference speed. Experiments on MVSEC and DSEC datasets demonstrate that our contributions are effective and that the accuracy can outperform existing supervised event-based and unsupervised frame-based methods.
Paper Structure (21 sections, 6 equations, 5 figures, 5 tables)

This paper contains 21 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Method overview. During the training(a), Pose-Net and Depth-Net are trained with aligned events and intensity frames. During the testing(b), Depth-Net estimates monocular depth map only from events.
  • Figure 2: Framework illustration. Our Depth-Net uses ResNet-18 as the encoder and uses several decoder nodes with multi-scale skip-connection architecture as the decoder. Multi-scale features $f^{e}_{i}$ are encoded by ResNet-18 from event voxel grid $E_{k}$. These features are fused by decoder nodes $x^{d}_{i}$ that includes our proposed multi-scale skip-connection architecture, and then outputs of decoder nodes are converted to disparity maps by $DispConv$ blocks. At the same time, the responding intensity frame and adjacent intensity frame are fed to a Pose-Net to predict relative pose $[R|t]$. Finally, the cross-modal consistency loss is computed using multi-scale event-based depth maps, relative pose, and intensity frames. After training, Depth-Net can estimate high-frequency monocular depth only from events.
  • Figure 3: Adjacent event spatiotemporal voxels visualization. The first row shows grayscale intensity frames from MVSEC, and the second row shows aligned events. We sum the voxel along the channel axis, then use blue and red pixels to represent negative and positive results. (a) and (b) are respectively from sample 125 and sample 126 of sequence $outdoor\_day1$. As shown in the figure, corresponding pixels can have very different events. So, the photoconsistency of event spatiotemporal voxel is weak.
  • Figure 4: Qualitative results on the MVSEC dataset. The qualitative result of Zhu et al. Zhu_2019_CVPR is omitted because their code isn't publicly available. Our EMoDepth has a relatively more reasonable prediction on distant areas, e.g., trees and sky in the upper part of the image.
  • Figure 5: Qualitative results on the DSEC dataset.$\mathbb{I}$ means using instensity frames as input and $\mathbb{E}$ means using events as input.