Table of Contents
Fetching ...

Temporal-Guided Visual Foundation Models for Event-Based Vision

Ruihao Xia, Junhong Cai, Luziwei Leng, Liuyi Wang, Chengju Liu, Ran Cheng, Yang Tang, Pan Zhou

TL;DR

The paper tackles the challenge of applying image-trained Visual Foundation Models to asynchronous event-based vision by introducing Temporal-Guided VFM (TGVFM). It presents a plug-and-play Temporal Context Fusion Block (TCFB) that enables long-range temporal reasoning through Long-Range Temporal Attention, Dual Spatiotemporal Attention, and Deep Feature Guidance Mechanism, while preserving pretrained VFM knowledge via zero-initialized residuals and careful training. A two-phase pipeline retrains E2VID on real-world data and fuses temporal context within transformer VFMs, achieving SoTA across semantic segmentation, depth estimation, and object detection on the DSEC benchmark, with significant gains over prior methods. The work demonstrates the viability of cross-modality transfer from image-based VFMs to event-based vision and highlights practical improvements such as memory-bank-based temporal modeling, semantic-guided fusion, and distillation-based supervision, offering a scalable path for temporal reasoning in event streams.

Abstract

Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.

Temporal-Guided Visual Foundation Models for Event-Based Vision

TL;DR

The paper tackles the challenge of applying image-trained Visual Foundation Models to asynchronous event-based vision by introducing Temporal-Guided VFM (TGVFM). It presents a plug-and-play Temporal Context Fusion Block (TCFB) that enables long-range temporal reasoning through Long-Range Temporal Attention, Dual Spatiotemporal Attention, and Deep Feature Guidance Mechanism, while preserving pretrained VFM knowledge via zero-initialized residuals and careful training. A two-phase pipeline retrains E2VID on real-world data and fuses temporal context within transformer VFMs, achieving SoTA across semantic segmentation, depth estimation, and object detection on the DSEC benchmark, with significant gains over prior methods. The work demonstrates the viability of cross-modality transfer from image-based VFMs to event-based vision and highlights practical improvements such as memory-bank-based temporal modeling, semantic-guided fusion, and distillation-based supervision, offering a scalable path for temporal reasoning in event streams.

Abstract

Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.

Paper Structure

This paper contains 20 sections, 14 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (1) TGVFM: Our proposed Temporal Context Fusion Block (TCFB) is integrated in a unified manner into VFMs specifically designed for different tasks, extending the spatial reasoning capability of traditional VFMs to spatio-temporal reasoning. (2) Experiments: Compared to the SoTA methods ECDDP ECDDP, CMDA CMDA, PCDepth PCDepth, and LEOD LEOD in the day and night sequences of DSEC datasets DSECESSCMDADSEC-Det, our TGVFM demonstrates significant improvements in all tasks.
  • Figure 2: Our TGVFM framework integrates several proposed TCFB between ViT blocks to extract both spatial and temporal features among multiple frames. In each TCFB, the input feature $f_t$ processed by different attention operations to interact with previous features $f_{t-1:t-k}$ and $\mathbf{F}_{t-1:t-k}$ stored in the memory bank for temporal reasoning. For clarity, we omit the residual connections in the attention and feed-forward network.
  • Figure 3: Left: Qualitative results of grayscale reconstruction for daytime and nighttime scenes by previous E2VID E2VID and our retrained E2VID-B4. Right: SSIM comparison between day and night sequences for different E2VID variants.
  • Figure 4: Comparison results of semantic segmentation with ECDDP ECDDP (Daytime) and CMDA CMDA (Nighttime).
  • Figure 5: Quantitative comparison results of monocular depth estimation with SoTA PCDepth PCDepth.
  • ...and 4 more figures