Table of Contents
Fetching ...

ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data

Carmen Martin-Turrero, Maxence Bouvier, Manuel Breitenstein, Pietro Zanuttigh, Vincent Parret

TL;DR

This work tackles the challenge of extracting dense, real-time insights from sparse, asynchronous event-based data. It introduces a hybrid architecture that first builds a PointNet-based LERT embedding to create tokens from local event patches, then extends this with an asynchronous ALERT module that updates tokens on the fly using time encoding and a memory-decay mechanism. The end-to-end trainable (A)LERT-Transformer enables both high-accuracy synchronous inference and ultra-low-latency asynchronous inference, demonstrated on gesture recognition and binary classification with favorable latency-accuracy trade-offs. The approach preserves event-data sparsity while exploiting standard ML tooling, making it well-suited for edge applications requiring flexible sampling rates and energy-efficient real-time processing.

Abstract

We seek to enable classic processing of continuous ultra-sparse spatiotemporal data generated by event-based sensors with dense machine learning models. We propose a novel hybrid pipeline composed of asynchronous sensing and synchronous processing that combines several ideas: (1) an embedding based on PointNet models -- the ALERT module -- that can continuously integrate new and dismiss old events thanks to a leakage mechanism, (2) a flexible readout of the embedded data that allows to feed any downstream model with always up-to-date features at any sampling rate, (3) exploiting the input sparsity in a patch-based approach inspired by Vision Transformer to optimize the efficiency of the method. These embeddings are then processed by a transformer model trained for object and gesture recognition. Using this approach, we achieve performances at the state-of-the-art with a lower latency than competitors. We also demonstrate that our asynchronous model can operate at any desired sampling rate.

ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data

TL;DR

This work tackles the challenge of extracting dense, real-time insights from sparse, asynchronous event-based data. It introduces a hybrid architecture that first builds a PointNet-based LERT embedding to create tokens from local event patches, then extends this with an asynchronous ALERT module that updates tokens on the fly using time encoding and a memory-decay mechanism. The end-to-end trainable (A)LERT-Transformer enables both high-accuracy synchronous inference and ultra-low-latency asynchronous inference, demonstrated on gesture recognition and binary classification with favorable latency-accuracy trade-offs. The approach preserves event-data sparsity while exploiting standard ML tooling, making it well-suited for edge applications requiring flexible sampling rates and energy-efficient real-time processing.

Abstract

We seek to enable classic processing of continuous ultra-sparse spatiotemporal data generated by event-based sensors with dense machine learning models. We propose a novel hybrid pipeline composed of asynchronous sensing and synchronous processing that combines several ideas: (1) an embedding based on PointNet models -- the ALERT module -- that can continuously integrate new and dismiss old events thanks to a leakage mechanism, (2) a flexible readout of the embedded data that allows to feed any downstream model with always up-to-date features at any sampling rate, (3) exploiting the input sparsity in a patch-based approach inspired by Vision Transformer to optimize the efficiency of the method. These embeddings are then processed by a transformer model trained for object and gesture recognition. Using this approach, we achieve performances at the state-of-the-art with a lower latency than competitors. We also demonstrate that our asynchronous model can operate at any desired sampling rate.
Paper Structure (38 sections, 9 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 38 sections, 9 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Schematic representation of the system integrating our proposed asynchronous embedding module. The asynchronous part (left) processes all events as they come in an event driven manner, thus updating the $Features$ tensor continuously. The synchronous part (right) samples the $Features$ tensor on demand, thus allowing a seamless interface between asynchronous and synchronous processing.
  • Figure 2: Overview of the (A)LERT-Transformer model. The event stream is spatially divided into local event clouds. The red crosses indicate the non-active tokens based on their number of events. (A)LERT module converts them to individual high dimension features, which are fed to a Transformer and classifier head.
  • Figure 3: Schematic representation of the temporal sampling. In (a) Constant Count Input Mode (CCIM), the events are sampled from the continuous input event stream by splitting them into bins that include the same amount of events. In (b) Constant Time Input Mode (CTIM), each bin represents the same input duration. In both modes, every sample is then split into several subsets, based on the spatial coordinates of the events, as depicted in Figure \ref{['fig:LERT-Transformer']}. Note: blue and red points represent events with different polarities.
  • Figure 4: (A)LERT module: spatially local event cloud to token.
  • Figure 5: Overview of the modes of the (A)LERT feature generator (FG). The Time Encoded LERT FG (b) is used for training ALERT (c).
  • ...and 9 more figures