Table of Contents
Fetching ...

V2CE: Video to Continuous Events Simulator

Zhongyang Zhang, Shuyang Cui, Kaidong Chai, Haowen Yu, Subhasis Dasgupta, Upal Mahbub, Tauhidur Rahman

TL;DR

V2CE tackles the challenge of generating continuous DVS-like event streams from ordinary videos by introducing a two-stage pipeline: Stage1 converts video into motion-aware event voxels using a 3D UNet with a comprehensive loss suite, and Stage2 recovers precise, continuous event timestamps via Local Dynamics-Aware Timestamp Inference (LDATI). The approach is validated on MVSEC, showing superior voxel fidelity compared with baselines and, crucially, a sampling strategy that preserves temporal dynamics and yields near-ground-truth event counts with low timestamp error. The work also introduces new metrics tailored to DVS event characteristics, enabling rigorous evaluation of both voxel-level predictions and continuous-event streams. Collectively, V2CE achieves state-of-the-art performance and real-time throughput, providing a practical path for high-fidelity DVS data generation and pretraining for event-based tasks.

Abstract

Dynamic Vision Sensor (DVS)-based solutions have recently garnered significant interest across various computer vision tasks, offering notable benefits in terms of dynamic range, temporal resolution, and inference speed. However, as a relatively nascent vision sensor compared to Active Pixel Sensor (APS) devices such as RGB cameras, DVS suffers from a dearth of ample labeled datasets. Prior efforts to convert APS data into events often grapple with issues such as a considerable domain shift from real events, the absence of quantified validation, and layering problems within the time axis. In this paper, we present a novel method for video-to-events stream conversion from multiple perspectives, considering the specific characteristics of DVS. A series of carefully designed losses helps enhance the quality of generated event voxels significantly. We also propose a novel local dynamic-aware timestamp inference strategy to accurately recover event timestamps from event voxels in a continuous fashion and eliminate the temporal layering problem. Results from rigorous validation through quantified metrics at all stages of the pipeline establish our method unquestionably as the current state-of-the-art (SOTA).

V2CE: Video to Continuous Events Simulator

TL;DR

V2CE tackles the challenge of generating continuous DVS-like event streams from ordinary videos by introducing a two-stage pipeline: Stage1 converts video into motion-aware event voxels using a 3D UNet with a comprehensive loss suite, and Stage2 recovers precise, continuous event timestamps via Local Dynamics-Aware Timestamp Inference (LDATI). The approach is validated on MVSEC, showing superior voxel fidelity compared with baselines and, crucially, a sampling strategy that preserves temporal dynamics and yields near-ground-truth event counts with low timestamp error. The work also introduces new metrics tailored to DVS event characteristics, enabling rigorous evaluation of both voxel-level predictions and continuous-event streams. Collectively, V2CE achieves state-of-the-art performance and real-time throughput, providing a practical path for high-fidelity DVS data generation and pretraining for event-based tasks.

Abstract

Dynamic Vision Sensor (DVS)-based solutions have recently garnered significant interest across various computer vision tasks, offering notable benefits in terms of dynamic range, temporal resolution, and inference speed. However, as a relatively nascent vision sensor compared to Active Pixel Sensor (APS) devices such as RGB cameras, DVS suffers from a dearth of ample labeled datasets. Prior efforts to convert APS data into events often grapple with issues such as a considerable domain shift from real events, the absence of quantified validation, and layering problems within the time axis. In this paper, we present a novel method for video-to-events stream conversion from multiple perspectives, considering the specific characteristics of DVS. A series of carefully designed losses helps enhance the quality of generated event voxels significantly. We also propose a novel local dynamic-aware timestamp inference strategy to accurately recover event timestamps from event voxels in a continuous fashion and eliminate the temporal layering problem. Results from rigorous validation through quantified metrics at all stages of the pipeline establish our method unquestionably as the current state-of-the-art (SOTA).
Paper Structure (11 sections, 9 equations, 7 figures, 4 tables)

This paper contains 11 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: K-nearest neighbor graph comparison from events in $(x,y,t)$ space. The v2e event stream is generated with input video upsampled to 3000FPS.
  • Figure 2: Proposed Motion-Aware Event Voxel Prediction Pipeline and Hybrid Loss Structure: Our method consists of two main stages. The Backbone 3D UNet encodes input frame pair sequences and generates event frames. The Event Sampling Module, subdivided into chain decoupling and distribution transformation modules, calculates event counts and in-voxel time, then redistributes events in Type2 voxels. The loss functions for training, displayed on the right, include STP, TP, ADV, BC, and EF Losses, which are elaborated in Section \ref{['sec:stage1']}.
  • Figure 3: Visualization for two stages in LDATI.
  • Figure 4: Event frame comparison between V2CE and baseline methods. Event frames are clipped to the maximum value of their corresponding ground truth event frames and then normalized.
  • Figure 5: Zoom-in comparison on event frame details.
  • ...and 2 more figures