Table of Contents
Fetching ...

Generative Event Pretraining with Foundation Model Alignment

Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza

Abstract

Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.

Generative Event Pretraining with Foundation Model Alignment

Abstract

Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
Paper Structure (36 sections, 28 equations, 8 figures, 11 tables)

This paper contains 36 sections, 28 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Overall comparison across ten metrics on various datasets. With the same backbone model, our method demonstrates superior and consistent performance across all tasks.
  • Figure 2: The overall two-stage framework. (a) Alignment stage: Event frames and synchronized images are encoded by an event encoder and a frozen VFM encoder. The event encoder is optimized with regression, contrastive, and preservation terms to match the semantic structure of image features. (b) Autoregressive pretraining stage: Aligned event and image embeddings are interleaved into a single sequence and processed by a causal transformer, which predicts future slices from partial windows, learning long-range temporal structure and cross-modal consistency. Arrows indicate causal dependencies; only a subset is shown for visual clarity.
  • Figure 3: Visualization of a 16-frame context (blue) and a 32-frame autoregressively generated future (red) on validation interleaved event and image streams. Green boxes highlight consistent motion. Insets enlarge the first context and last generated frames for clarity.
  • Figure 4: Qualitative results on the DSEC validation set. The RGB images are shown only for visual reference and are not used by the model.
  • Figure 5: Attention map response on event camera data.
  • ...and 3 more figures