Table of Contents
Fetching ...

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou, Jinxiu Liang, Minggui Teng, Yi Wang, Boxin Shi

TL;DR

This work tackles the data scarcity in event-based vision by introducing Video-to-Voxel (V2V), a principled method that directly converts conventional videos into discrete voxel representations, bypassing costly event-stream generation. By discarding intra-bin timing and using on-the-fly randomization of camera parameters, V2V achieves up to ~150x storage reduction and enables training on large-scale datasets such as WebVid, improving robustness and diversity. The authors validate V2V by training and evaluating state-of-the-art video reconstruction (E2VID) and optical flow (EvFlow) models, demonstrating comparable or superior performance to traditional event-based pipelines and highlighting improvements from per-iteration data variation. The approach significantly lowers data-collection barriers, accelerates scaling of event-based training, and broadens the applicability of event-based methods to real-world, high-variation datasets.

Abstract

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

TL;DR

This work tackles the data scarcity in event-based vision by introducing Video-to-Voxel (V2V), a principled method that directly converts conventional videos into discrete voxel representations, bypassing costly event-stream generation. By discarding intra-bin timing and using on-the-fly randomization of camera parameters, V2V achieves up to ~150x storage reduction and enables training on large-scale datasets such as WebVid, improving robustness and diversity. The authors validate V2V by training and evaluating state-of-the-art video reconstruction (E2VID) and optical flow (EvFlow) models, demonstrating comparable or superior performance to traditional event-based pipelines and highlighting improvements from per-iteration data variation. The approach significantly lowers data-collection barriers, accelerates scaling of event-based training, and broadens the applicability of event-based methods to real-world, high-variation datasets.

Abstract

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

Paper Structure

This paper contains 21 sections, 9 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Event data's inherent ambiguity: multiple valid image sequences $I(t_0)$ can produce identical event streams $\mathcal{E}$ under different initial conditions and camera parameters $M$. It indicates the critical importance of diverse training data to establish robust priors for reconstruction tasks.
  • Figure 1: Comparison of dataset characteristics. "Seqs" for event datasets represents total frames divided by 40, providing a normalized comparison metric.
  • Figure 2: While synthetic (top) and real (bottom) events exhibit notable differences in interpolated voxels (left) due to microsecond-level temporal disparities, they demonstrate similarity in discrete voxel representations (right)—justifying our direct video-to-voxel conversion approach.
  • Figure 3: The V2V module efficiently converts input videos to output voxels with random parameters selected at train time (left). We use the V2V module and a lightweight optical flow estimator RAFT to train video reconstruction (middle) and optical flow estimation models (right).
  • Figure 4: Effectiveness of the proposed parameter randomization across dataset sizes.
  • ...and 12 more figures