Table of Contents
Fetching ...

Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space

Kaustav Chanda, Aayush Atul Verma, Arpitsinh Vaghela, Yezhou Yang, Bharatesh Chakravarthi

TL;DR

The paper tackles the sim-to-real gap in event-camera data by introducing Event Quality Score (EQS), a differentiable metric that directly compares two raw event streams using latent features from a pre-trained recurrent vision transformer. EQS operates on event tensors and leverages activations from the first convolutional blocks to quantify similarity via latent-space distances, providing a numeric measure of realism for simulated streams. Empirical results on the DSEC driving dataset show that higher EQS aligns with better generalization of models trained on simulated data to real-world data, with ESIM producing the closest match to real noise patterns and the smallest sim-to-real gap. This metric offers a principled, task-agnostic way to optimize simulators and could be incorporated as a loss to produce more realistic event streams for downstream vision tasks.

Abstract

Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at https://github.com/eventbasedvision/EQS.

Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space

TL;DR

The paper tackles the sim-to-real gap in event-camera data by introducing Event Quality Score (EQS), a differentiable metric that directly compares two raw event streams using latent features from a pre-trained recurrent vision transformer. EQS operates on event tensors and leverages activations from the first convolutional blocks to quantify similarity via latent-space distances, providing a numeric measure of realism for simulated streams. Empirical results on the DSEC driving dataset show that higher EQS aligns with better generalization of models trained on simulated data to real-world data, with ESIM producing the closest match to real noise patterns and the smallest sim-to-real gap. This metric offers a principled, task-agnostic way to optimize simulators and could be incorporated as a loss to produce more realistic event streams for downstream vision tasks.

Abstract

Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at https://github.com/eventbasedvision/EQS.

Paper Structure

This paper contains 11 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Schematic representation of a recurrent vision transformer (RVT) model trained on simulated event data and evaluated on real event streams.
  • Figure 2: Simulated events from the V2E hu2021v2evideoframesrealistic, ESIM Rebecq2018, and PIX2NVS pix2nvs event camera simulators and real event camera data visualized as frames where red pixels indicate negative polarity and blue pixels indicate positive polarity.
  • Figure 3: (a). Dataset with frames and their corresponding synchronized event streams. Simulated events are generated using frames as input, and both sets of events are composed into tensors as discussed in \ref{['event_proc']}. (b). For each scale in the RVT model, cosine distances are calculated between the feature maps from corresponding convolution layers and aggregated to obtain the final score.
  • Figure 4: Detection output samples for RVT-small model on real event streams after training on simulated datasets generated using the ESIM, V2E, and PIX2NVS simulators.