Table of Contents
Fetching ...

Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras

Christoffer Koo Øhrstrøm, Ronja Güldenring, Lazaros Nalpantidis

TL;DR

The paper tackles the challenge of representing asynchronous, sparse event camera data for real-time processing. It introduces Spiking Patches (SP), a patch-based spiking tokenizer that preserves asynchrony and spatial sparsity, producing tokens that can be consumed by GNNs, PCNs, and Vision Transformers. Across gesture recognition and object detection, SP matches or improves accuracy while delivering major speedups (up to $3.4\times$ over voxel-based tokens and $10.4\times$ over frames), and ablations show that threshold and refractory period choices effectively control token counts without large accuracy penalties. This work demonstrates that tokenization is a viable, scalable direction for event-based vision, enabling real-time performance on practical hardware.

Abstract

We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.

Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras

TL;DR

The paper tackles the challenge of representing asynchronous, sparse event camera data for real-time processing. It introduces Spiking Patches (SP), a patch-based spiking tokenizer that preserves asynchrony and spatial sparsity, producing tokens that can be consumed by GNNs, PCNs, and Vision Transformers. Across gesture recognition and object detection, SP matches or improves accuracy while delivering major speedups (up to over voxel-based tokens and over frames), and ablations show that threshold and refractory period choices effectively control token counts without large accuracy penalties. This work demonstrates that tokenization is a viable, scalable direction for event-based vision, enabling real-time performance on practical hardware.

Abstract

We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.

Paper Structure

This paper contains 18 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We present a tokenizer particularly developed for event cameras: Spiking Patches (SP). Events (left) are tokenized using SP (middle) and the resulting tokens (right) are processed by a model. We show that SP is compatible with Graph Neural Networks (GNNs), Point Cloud Networks (PCNs), and Transformers. SP produces tokens that preserve two unique characteristics of event cameras: Asynchrony and spatial sparsity.
  • Figure 2: Overview of Spiking Patches. Events are arriving within a single patch. Each event causes a constant increase in the patch potential. The first spike occurs at event $e_4$ when the potential exceeds the threshold $\sigma$. The resulting token consists of $e_1$ through $e_4$. The events $e_5$ through $e_7$ are ignored as they occur within the refractory period $T$. The second spike occurs at $e_{11}$ and the resulting token consists of $e_{8}$ through $e_{11}$.
  • Figure 3: Event accumulation for events, Spiking Patches (3 different thresholds), and frames. Frames are synchronous with a 50 ms response time. Spiking Patches is asynchronous and follows the slope of the event curve, albeit with a 5 ms to 20 ms delay. Spiking Patches can react faster than synchronous methods.
  • Figure 4: Spatial sparsity differences between events and frames (top), voxels (middle), and Spiking Patches (bottom). We use the colormap: . Negative values are more dense and positive values are more sparse than events. Spiking Patches has comparable sparsity to events for $\sigma \in [250, 500]$ or $T = 100$ ms.
  • Figure 5: Ablation of spike threshold $\sigma$. mAP remains high for $\sigma \leq 250$ and decreases afterwards.
  • ...and 5 more figures