Temporal-Guided Visual Foundation Models for Event-Based Vision
Ruihao Xia, Junhong Cai, Luziwei Leng, Liuyi Wang, Chengju Liu, Ran Cheng, Yang Tang, Pan Zhou
TL;DR
The paper tackles the challenge of applying image-trained Visual Foundation Models to asynchronous event-based vision by introducing Temporal-Guided VFM (TGVFM). It presents a plug-and-play Temporal Context Fusion Block (TCFB) that enables long-range temporal reasoning through Long-Range Temporal Attention, Dual Spatiotemporal Attention, and Deep Feature Guidance Mechanism, while preserving pretrained VFM knowledge via zero-initialized residuals and careful training. A two-phase pipeline retrains E2VID on real-world data and fuses temporal context within transformer VFMs, achieving SoTA across semantic segmentation, depth estimation, and object detection on the DSEC benchmark, with significant gains over prior methods. The work demonstrates the viability of cross-modality transfer from image-based VFMs to event-based vision and highlights practical improvements such as memory-bank-based temporal modeling, semantic-guided fusion, and distillation-based supervision, offering a scalable path for temporal reasoning in event streams.
Abstract
Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.
