Event Camera Data Dense Pre-training
Yan Yang, Liyuan Pan, Liu Liu
TL;DR
This work addresses pre-training neural networks for dense prediction tasks using event camera data, highlighting the inefficacy of transferring RGB-based pre-training due to sparse event images. It introduces a self-supervised framework with a teacher-student EMA that enforces patch-level, context-level, and image-level similarities, including an on-the-fly context mining via K-means to create discriminative contexts for learning. A synthetic E-TartanAir dataset provides diverse scenes and motions for robust pre-training. Across semantic segmentation, optical flow, and depth estimation, the method achieves state-of-the-art results, demonstrating the value of context-aware, event-only pre-training and offering potential for future event-based foundation models.
Abstract
This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training. Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features. For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.
