Table of Contents
Fetching ...

EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision

Qiang Qu, Xiaoming Chen, Yuk Ying Chung, Yiran Shen

TL;DR

This paper introduces EvRepSL, a high-quality, self-supervised event-stream representation for event-based vision. It starts with EvRep, a three-channel spatial–temporal representation, and derives a theoretical link between asynchronous events and synchronous frames to enable refinement via RepGen, which outputs EvRepSL without task-specific retraining. Through extensive experiments on classification and optical flow, EvRepSL substantially outperforms existing representations while remaining agnostic to the camera type and downstream task. The approach delivers practical gains in accuracy and efficiency, establishing EvRepSL as a versatile foundation for future event-based vision systems.

Abstract

Event-stream representation is the first step for many computer vision tasks using event cameras. It converts the asynchronous event-streams into a formatted structure so that conventional machine learning models can be applied easily. However, most of the state-of-the-art event-stream representations are manually designed and the quality of these representations cannot be guaranteed due to the noisy nature of event-streams. In this paper, we introduce a data-driven approach aiming at enhancing the quality of event-stream representations. Our approach commences with the introduction of a new event-stream representation based on spatial-temporal statistics, denoted as EvRep. Subsequently, we theoretically derive the intrinsic relationship between asynchronous event-streams and synchronous video frames. Building upon this theoretical relationship, we train a representation generator, RepGen, in a self-supervised learning manner accepting EvRep as input. Finally, the event-streams are converted to high-quality representations, termed as EvRepSL, by going through the learned RepGen (without the need of fine-tuning or retraining). Our methodology is rigorously validated through extensive evaluations on a variety of mainstream event-based classification and optical flow datasets (captured with various types of event cameras). The experimental results highlight not only our approach's superior performance over existing event-stream representations but also its versatility, being agnostic to different event cameras and tasks.

EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision

TL;DR

This paper introduces EvRepSL, a high-quality, self-supervised event-stream representation for event-based vision. It starts with EvRep, a three-channel spatial–temporal representation, and derives a theoretical link between asynchronous events and synchronous frames to enable refinement via RepGen, which outputs EvRepSL without task-specific retraining. Through extensive experiments on classification and optical flow, EvRepSL substantially outperforms existing representations while remaining agnostic to the camera type and downstream task. The approach delivers practical gains in accuracy and efficiency, establishing EvRepSL as a versatile foundation for future event-based vision systems.

Abstract

Event-stream representation is the first step for many computer vision tasks using event cameras. It converts the asynchronous event-streams into a formatted structure so that conventional machine learning models can be applied easily. However, most of the state-of-the-art event-stream representations are manually designed and the quality of these representations cannot be guaranteed due to the noisy nature of event-streams. In this paper, we introduce a data-driven approach aiming at enhancing the quality of event-stream representations. Our approach commences with the introduction of a new event-stream representation based on spatial-temporal statistics, denoted as EvRep. Subsequently, we theoretically derive the intrinsic relationship between asynchronous event-streams and synchronous video frames. Building upon this theoretical relationship, we train a representation generator, RepGen, in a self-supervised learning manner accepting EvRep as input. Finally, the event-streams are converted to high-quality representations, termed as EvRepSL, by going through the learned RepGen (without the need of fine-tuning or retraining). Our methodology is rigorously validated through extensive evaluations on a variety of mainstream event-based classification and optical flow datasets (captured with various types of event cameras). The experimental results highlight not only our approach's superior performance over existing event-stream representations but also its versatility, being agnostic to different event cameras and tasks.

Paper Structure

This paper contains 25 sections, 23 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of the Proposed Methodology. The figure illustrates the self-supervised learning framework for generating event-stream representations. Left: Raw event streams are processed into a 3-channel EvRep representation capturing spatial, polarity, and temporal information. These representations are then used to train the representation generator, RepGen, based on a hybrid, unlabelled dataset. Middle: Through a self-supervised learning process, RepGen learns an enhanced event representation without requiring labeled data. Right: Once trained, RepGen becomes device- and task-agnostic, enabling it to generate high-quality representations (EvRepSL) for other event-only datasets without requiring fine-tuning or retraining. This flexibility allows EvRepSL to integrate seamlessly with various downstream event-based tasks, functioning similarly to traditional representations while maintaining its versatility.
  • Figure 2: Demonstration of varying event-stream patterns in the temporal domain, as reflected by different values of the proposed temporal channel $\mathcal{E}_T$, which captures the distinctive timing dynamics of the events.
  • Figure 3: Network architecture of the proposed RepGen for self-supervised representation learning. It consists of a shared downsampling encoder $h^{enc}$ and two separate upsampling decoders ($h^{dec}_{\mathcal{E}_I}$ and $h^{dec}_{\theta}$) exquisitely designed for estimating $\mathcal{E_I}$ and $\theta$ respectively, followed by a computing module to predict the next frame $f_1$. It is important to emphasize that the block Frame-Event Relation is used solely to guide the self-supervised learning process and is not included in downstream applications. In other words, once RepGen is trained during self-supervised learning, it can be directly applied to any event-only datasets, without requiring the presence of frame data.
  • Figure 4: Visualization of $\mathcal{E}_{I}$ (first row) and $\mathcal{E}_{I}^{rfd}$ (second row)