Table of Contents
Fetching ...

IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

Dmitrii Torbunov, Onur Okuducu, Yi Huang, Odera Dim, Rebecca Coles, Yonggang Cui, Yihui Ren

TL;DR

The paper tackles power-efficient RGB video capture by proposing a hybrid scheme that records sparse RGB keyframes alongside continuous event streams and reconstructs full RGB video offline. It compares an autoregressive baseline based on HyperE2VID with a diffusion-based approach that injects event information into a pretrained LTX video model using LoRA adapters and an event encoder. The diffusion-based, event-conditioned method achieves about 33% better perceptual quality (LPIPS) and demonstrates strong cross-dataset generalization and substantial temporal extrapolation to 32–128 frames. This work demonstrates the practicality of leveraging pretrained diffusion models with event conditioning for long-horizon, power-aware video generation suited for surveillance, robotics, and wearable vision applications.

Abstract

Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline -- reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33\% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.

IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

TL;DR

The paper tackles power-efficient RGB video capture by proposing a hybrid scheme that records sparse RGB keyframes alongside continuous event streams and reconstructs full RGB video offline. It compares an autoregressive baseline based on HyperE2VID with a diffusion-based approach that injects event information into a pretrained LTX video model using LoRA adapters and an event encoder. The diffusion-based, event-conditioned method achieves about 33% better perceptual quality (LPIPS) and demonstrates strong cross-dataset generalization and substantial temporal extrapolation to 32–128 frames. This work demonstrates the practicality of leveraging pretrained diffusion models with event conditioning for long-horizon, power-aware video generation suited for surveillance, robotics, and wearable vision applications.

Abstract

Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline -- reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33\% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.

Paper Structure

This paper contains 55 sections, 7 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Task: RGB video reconstruction from a keyframe and sparse event camera data. Given only the first RGB frame (leftmost, shared between rows) and event camera stream (top row showing motion-encoded brightness changes), the goal is to generate a full RGB video sequence matching the ground truth frames (bottom row).
  • Figure 2: LTX-Events architecture. The video generation is trained by injecting encoded event information into the Transformer stack of the denoise decoder of pretrained LTX.
  • Figure 3: Qualitative comparison on BS-ERGB.
  • Figure 4: Qualitative comparison on BS-ERGB (32 frames) showing a person preparing to jump. Column labels indicate frame numbers.
  • Figure 5: Qualitative comparison on HS-ERGB close (32 frames) showing a water balloon falling and bursting. Column labels indicate frame numbers.
  • ...and 10 more figures