On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events
Jesse Hagenaars, Yilun Wu, Federico Paredes-Vallés, Stein Stroobants, Guido de Croon
TL;DR
The paper tackles on-device, self-supervised monocular depth learning from event cameras on resource-constrained robots. It introduces a CUDA-accelerated learning pipeline that uses contrast maximization to estimate depth and ego-motion from event streams, with the overall loss $\mathcal{L} = \mathcal{L}_\text{CM} + \lambda \mathcal{L}_\text{geo}$ and a geometry-consistency term $\mathcal{L}_\text{geo}$. On a small drone, online fine-tuning improves depth accuracy and obstacle avoidance, achieving state-of-the-art performance among self-supervised event-based methods on MVSEC and DSEC, while delivering substantial efficiency gains (approximately 100× runtime reduction and 2–5× memory reduction) enabling 30 Hz operation on embedded hardware. This demonstrates the practical viability of online SSL for depth-from-events, reducing reality gaps and enhancing robust autonomous navigation in real-world environments.
Abstract
Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need for high-frequency ground truth and allows for online learning in the robot's operational environment. However, online, on-board learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization pipeline, making on-device learning of low-latency monocular depth possible. We demonstrate that online learning on board a small drone yields more accurate depth estimates and more successful obstacle avoidance behavior compared to only pre-training. Benchmarking experiments show that the proposed pipeline is not only efficient, but also achieves state-of-the-art depth estimation performance among self-supervised approaches. Our work taps into the unused potential of online, on-device robot learning, promising smaller reality gaps and better performance.
