Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
Christoffer Koo Øhrstrøm, Ronja Güldenring, Lazaros Nalpantidis
TL;DR
The paper tackles the challenge of representing asynchronous, sparse event camera data for real-time processing. It introduces Spiking Patches (SP), a patch-based spiking tokenizer that preserves asynchrony and spatial sparsity, producing tokens that can be consumed by GNNs, PCNs, and Vision Transformers. Across gesture recognition and object detection, SP matches or improves accuracy while delivering major speedups (up to $3.4\times$ over voxel-based tokens and $10.4\times$ over frames), and ablations show that threshold and refractory period choices effectively control token counts without large accuracy penalties. This work demonstrates that tokenization is a viable, scalable direction for event-based vision, enabling real-time performance on practical hardware.
Abstract
We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.
