LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

Kanghao Chen; Hangyu Li; JiaZhou Zhou; Zeyu Wang; Lin Wang

LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

Kanghao Chen, Hangyu Li, JiaZhou Zhou, Zeyu Wang, Lin Wang

TL;DR

LaSe-E2V tackles the ill-posed problem of reconstructing videos from event camera streams by injecting semantic guidance from natural language descriptions. It couples a text-conditioned diffusion model with an Event-guided Spatio-temporal Attention (ESA), an event-aware mask loss, and an event-aware noise initialization to enforce semantic and spatiotemporal coherence between the event data and the generated video. The framework reuses existing E2V datasets by generating textual descriptions and demonstrates superior performance across multiple real-world and synthetic benchmarks, notably under fast motion and low-light conditions. This approach advances E2V by leveraging language as a rich semantic prior, enabling more accurate, visually coherent, and controllable video reconstructions from sparse event data.

Abstract

Event cameras harness advantages such as low latency, high temporal resolution, and high dynamic range (HDR), compared to standard cameras. Due to the distinct imaging paradigm shift, a dominant line of research focuses on event-to-video (E2V) reconstruction to bridge event-based and standard computer vision. However, this task remains challenging due to its inherently ill-posed nature: event cameras only detect the edge and motion information locally. Consequently, the reconstructed videos are often plagued by artifacts and regional blur, primarily caused by the ambiguous semantics of event data. In this paper, we find language naturally conveys abundant semantic information, rendering it stunningly superior in ensuring semantic consistency for E2V reconstruction. Accordingly, we propose a novel framework, called LaSe-E2V, that can achieve semantic-aware high-quality E2V reconstruction from a language-guided perspective, buttressed by the text-conditional diffusion models. However, due to diffusion models' inherent diversity and randomness, it is hardly possible to directly apply them to achieve spatial and temporal consistency for E2V reconstruction. Thus, we first propose an Event-guided Spatiotemporal Attention (ESA) module to condition the event data to the denoising pipeline effectively. We then introduce an event-aware mask loss to ensure temporal coherence and a noise initialization strategy to enhance spatial consistency. Given the absence of event-text-video paired data, we aggregate existing E2V datasets and generate textual descriptions using the tagging models for training and evaluation. Extensive experiments on three datasets covering diverse challenging scenarios (e.g., fast motion, low light) demonstrate the superiority of our method.

LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

TL;DR

Abstract

Paper Structure (13 sections, 7 equations, 8 figures, 3 tables)

This paper contains 13 sections, 7 equations, 8 figures, 3 tables.

Introduction
Related Works
The Proposed LaSe-E2V Framework
Overall Pipeline
Event-guided Spatio-temporal Attention (ESA)
Event-aware Mask Loss
Event-aware Noise Initialization
Experiments
Datasets and Implementation Details
Comparison with State-of-the-Art Methods
Discussion
Ablation Study
Discussion and Conclusion

Figures (8)

Figure 1: Comparison of the E2V pipeline between HyperE2VID ercan2024hypere2vid and our LaSe-E2V: The baseline method solely relies on event data, leading to ambiguity in local structures. In contrast, our approach integrates language descriptions to enrich the semantic information and ensure the video remains coherent with the event stream.
Figure 2: An overview of our proposed LaSe-E2V framework.
Figure 3: Qualitative results of fast-motion condition from HS-ERGB dataset tulyakov2021time.
Figure 4: Qualitative comparisons on four sampled sequences from the test datasets. While the previous approaches suffer from low contrast, blur, and extensive artifacts, LaSe-E2V obtains clear edges with high contrast and preserves the semantic details of the objects
Figure 5: Qualitative results in low light condition from MVSEC dataset zhu2018multivehicle (outdoor_night2). LaSe-E2V performs better to preserve the HDR characteristic of event cameras with higher contrast.
...and 3 more figures

LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

TL;DR

Abstract

LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)