V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou; Jinxiu Liang; Minggui Teng; Yi Wang; Boxin Shi

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou, Jinxiu Liang, Minggui Teng, Yi Wang, Boxin Shi

TL;DR

This work tackles the data scarcity in event-based vision by introducing Video-to-Voxel (V2V), a principled method that directly converts conventional videos into discrete voxel representations, bypassing costly event-stream generation. By discarding intra-bin timing and using on-the-fly randomization of camera parameters, V2V achieves up to ~150x storage reduction and enables training on large-scale datasets such as WebVid, improving robustness and diversity. The authors validate V2V by training and evaluating state-of-the-art video reconstruction (E2VID) and optical flow (EvFlow) models, demonstrating comparable or superior performance to traditional event-based pipelines and highlighting improvements from per-iteration data variation. The approach significantly lowers data-collection barriers, accelerates scaling of event-based training, and broadens the applicability of event-based methods to real-world, high-variation datasets.

Abstract

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

TL;DR

Abstract

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)