Table of Contents
Fetching ...

Event Transformer+. A multi-purpose solution for efficient event data processing

Alberto Sabater, Luis Montesano, Ana C. Murillo

TL;DR

Event Transformer+ (EvT+) tackles the challenge of efficiently leveraging sparse event-camera data for recognition and depth estimation by introducing a refined patch-based representation and a memory-augmented transformer backbone capable of fusing multi-modal inputs. It supports two task heads for event-stream classification and dense per-pixel estimation, achieving state-of-the-art or competitive results on real-event benchmarks while maintaining low latency on both GPU and CPU. The approach demonstrates strong performance across modalities (events plus grayscale images) and tasks, highlighting the practical impact of Transformer-based architectures on sparse sensor data. Overall, EvT+ offers a scalable, efficient framework that can extend to other sparse sensing modalities such as LiDAR, enabling fast, accurate perception in resource-constrained environments.

Abstract

Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current topperforming methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer+, that improves our seminal work EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream classification (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU.

Event Transformer+. A multi-purpose solution for efficient event data processing

TL;DR

Event Transformer+ (EvT+) tackles the challenge of efficiently leveraging sparse event-camera data for recognition and depth estimation by introducing a refined patch-based representation and a memory-augmented transformer backbone capable of fusing multi-modal inputs. It supports two task heads for event-stream classification and dense per-pixel estimation, achieving state-of-the-art or competitive results on real-event benchmarks while maintaining low latency on both GPU and CPU. The approach demonstrates strong performance across modalities (events plus grayscale images) and tasks, highlighting the practical impact of Transformer-based architectures on sparse sensor data. Overall, EvT+ offers a scalable, efficient framework that can extend to other sparse sensing modalities such as LiDAR, enabling fast, accurate perception in resource-constrained environments.

Abstract

Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current topperforming methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer+, that improves our seminal work EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream classification (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU.
Paper Structure (17 sections, 7 equations, 4 figures, 5 tables)

This paper contains 17 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Framework overview. Areas (activated patches (2)) from the input data (event frames and images (1)) with sufficient information are extracted and processed by the EvT$^+$backbone (3) to update a set of latent memory vectors. Different output heads (4) are used for: a) event-stream classification by processing the latent memory, and b) multi-modal dense estimation by updating and further processing the input information with the latent memory vectors.
  • Figure 2: Patch-based event data representation. (a) For each pixel, we retain the last $K$ events with sufficient sparsity in time. (b) Frame representations are built with the time-stamps of the queued events. (c) Frames are split into patches, keeping only the activated patches, i.e., with enough event information generated during the time-window span.
  • Figure 3: Event Transformer$^+$ overview. The input is a set of time-window representations (e.g., event frames or images) that are processed sequentially. Each time-window representation generates a set of patch tokens $T$ that is processed (1., 2.) to update a set of latent memory vectors (3.), which encodes the information seen so far. For event-stream classification (4.a.), the latent vectors are directly processed with a simple classifier. For dense estimation (4.b.), we convert the input sparse representation to a dense one by adding dummy tokens and positional information, and we process it along with the information encoded in the latent memory vectors, generating the final dense prediction.
  • Figure 4: Avg. number of activated patches (vertical axis) generated at each time window on different datasets with different patch sizes (horizontal axis). Stars: the selected hyperparameter value.