Table of Contents
Fetching ...

On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events

Jesse Hagenaars, Yilun Wu, Federico Paredes-Vallés, Stein Stroobants, Guido de Croon

TL;DR

The paper tackles on-device, self-supervised monocular depth learning from event cameras on resource-constrained robots. It introduces a CUDA-accelerated learning pipeline that uses contrast maximization to estimate depth and ego-motion from event streams, with the overall loss $\mathcal{L} = \mathcal{L}_\text{CM} + \lambda \mathcal{L}_\text{geo}$ and a geometry-consistency term $\mathcal{L}_\text{geo}$. On a small drone, online fine-tuning improves depth accuracy and obstacle avoidance, achieving state-of-the-art performance among self-supervised event-based methods on MVSEC and DSEC, while delivering substantial efficiency gains (approximately 100× runtime reduction and 2–5× memory reduction) enabling 30 Hz operation on embedded hardware. This demonstrates the practical viability of online SSL for depth-from-events, reducing reality gaps and enhancing robust autonomous navigation in real-world environments.

Abstract

Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need for high-frequency ground truth and allows for online learning in the robot's operational environment. However, online, on-board learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization pipeline, making on-device learning of low-latency monocular depth possible. We demonstrate that online learning on board a small drone yields more accurate depth estimates and more successful obstacle avoidance behavior compared to only pre-training. Benchmarking experiments show that the proposed pipeline is not only efficient, but also achieves state-of-the-art depth estimation performance among self-supervised approaches. Our work taps into the unused potential of online, on-device robot learning, promising smaller reality gaps and better performance.

On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events

TL;DR

The paper tackles on-device, self-supervised monocular depth learning from event cameras on resource-constrained robots. It introduces a CUDA-accelerated learning pipeline that uses contrast maximization to estimate depth and ego-motion from event streams, with the overall loss and a geometry-consistency term . On a small drone, online fine-tuning improves depth accuracy and obstacle avoidance, achieving state-of-the-art performance among self-supervised event-based methods on MVSEC and DSEC, while delivering substantial efficiency gains (approximately 100× runtime reduction and 2–5× memory reduction) enabling 30 Hz operation on embedded hardware. This demonstrates the practical viability of online SSL for depth-from-events, reducing reality gaps and enhancing robust autonomous navigation in real-world environments.

Abstract

Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need for high-frequency ground truth and allows for online learning in the robot's operational environment. However, online, on-board learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization pipeline, making on-device learning of low-latency monocular depth possible. We demonstrate that online learning on board a small drone yields more accurate depth estimates and more successful obstacle avoidance behavior compared to only pre-training. Benchmarking experiments show that the proposed pipeline is not only efficient, but also achieves state-of-the-art depth estimation performance among self-supervised approaches. Our work taps into the unused potential of online, on-device robot learning, promising smaller reality gaps and better performance.

Paper Structure

This paper contains 31 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Online, on-device learning allows robots to "train in their test environment". We improve the time and memory efficiency of the self-supervised contrast maximization pipeline, such that on-board learning of monocular depth from event camera data becomes possible. When deployed on a small drone, online learning leads to better depth estimates and more successful obstacle avoidance behavior.
  • Figure 2: Top left: events $e$ of different polarities are warped forward and backward by bilinearly sampled (BS) optical flows $\bm{u}$. Events warped outside the image are discarded. Next, events are bilinearly splatted to IWEs (images of warped events) at all reference times $t^*_\text{ref}$. Bottom left: batched processing of events such as in paredes-valles2023taming requires zero-padding bins of events to equal length to facilitate simultaneous warping to neighboring reference times. In contrast, our per-event parallel processing in CUDA warps all events independently, doing away with padding and allowing to warp only those events still in the image space. Right: Runtime and peak increase in memory consumption for different phases of computing the contrast maximization loss on an NVIDIA RTX 4090 and Jetson Orin NX. 10% of each bin is made up of padding, which is not processed by the CUDA implementation. We indicate the range of events per bin for common datasets delmerico2019arezhu2018multivehiclea in black. Naive PyTorch processes all events in a for-loop. While batching events together improves a lot over this, parallel processing of all events in CUDA results in even bigger speedups with less memory consumed.
  • Figure 3: Overview of the drone (left) and the flight environment (right). System components in blue are for the on-board depth learning pipeline, orange components are for low-level flight control, and green components are for logging only.
  • Figure 4: Qualitative results of disparity predictions on the DSEC disparity benchmark. Images are for visualization only, as disparity estimation is event-based. The same color map is applied to the disparity values from the stereo- and supervised-learning-based method from Cho et al.cho2025temporal and our monocular, self-supervised learning method for easy comparison.
  • Figure 5: Left: Boxplots of distance between pilot interventions during flight experiments. While using ground truth (GT) depth is best, adding online learning (PT + OL) improves over just pre-training (PT) by 30%. Training from scratch (TFS) does not result in meaningful obstacle avoidance. Right: MAE (mean absolute error) of depth prediction and RSAT (ratio of squared average timestamps, indicates deblurring quality) during online learning in flight. Model checkpoints were saved periodically and evaluated on a test sequence unseen by the model beforehand. 300 learning steps correspond to roughly 100 seconds of training during flight.
  • ...and 7 more figures