EvtSlowTV -- A Large and Diverse Dataset for Event-Based Depth Estimation
Sadiq Layi Macaulay, Nimet Kaygusuz, Simon Hadfield
TL;DR
EvtSlowTV tackles the paucity of large-scale datasets for event-based depth estimation by introducing a real-world, unconstrained dataset derived from YouTube footage, containing over 13B events across diverse environments. It enables self-supervised depth learning by preserving the asynchronous nature of event streams and employing a contrast maximization objective together with a teacher-student training strategy. The approach uses adaptive frame sampling to generate high-fidelity events from SlowTV videos and a 5-bin event volume fed into a skip-connected encoder-decoder to estimate depth and pose. Evaluations demonstrate improved generalization and competitive performance against state-of-the-art methods, highlighting the practical impact of large-scale, diverse event data for robust depth perception in challenging lighting and motion conditions.
Abstract
Event cameras, with their high dynamic range (HDR) and low latency, offer a promising alternative for robust depth estimation in challenging environments. However, many event-based depth estimation approaches are constrained by small-scale annotated datasets, limiting their generalizability to real-world scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event camera dataset curated from publicly available YouTube footage, which contains more than 13B events across various environmental conditions and motions, including seasonal hiking, flying, scenic driving, and underwater exploration. EvtSlowTV is an order of magnitude larger than existing event datasets, providing an unconstrained, naturalistic setting for event-based depth learning. This work shows the suitability of EvtSlowTV for a self-supervised learning framework to capitalise on the HDR potential of raw event streams. We further demonstrate that training with EvtSlowTV enhances the model's ability to generalise to complex scenes and motions. Our approach removes the need for frame-based annotations and preserves the asynchronous nature of event data.
