Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

Haiqing Hao; Zhipeng Sui; Rong Zou; Zijia Dai; Nikola Zubić; Davide Scaramuzza; Wenhui Wang

Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

Haiqing Hao, Zhipeng Sui, Rong Zou, Zijia Dai, Nikola Zubić, Davide Scaramuzza, Wenhui Wang

TL;DR

Spatially-Sparse Linear Attention (SSLA) is proposed, which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism in event-based object detection.

Abstract

Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.

Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

TL;DR

Abstract

Paper Structure (31 sections, 8 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Asynchronous Event-based Neural Networks
State-level Sparsity in Linear Attention
Method
Problem Formulation.
Preliminaries: Linear Attention
Spatially-Sparse Linear Attention
Sparse State Activation with Mixture-of-Spaces
Position-Aware Projection
Parallelizable Training
Scatter.
Compute.
Gather.
Neural Network Details
...and 16 more sections

Figures (4)

Figure 1: Our method processes (a) asynchronous event sequence with (b) a sparsely activated linear attention neural network for (c) low-latency event-based object detection. On the Gen1 dataset, our SSLA-Det models achieve SoTA asynchronous mAP and lower FLOPS compared with previous asynchronous baselines (d). $^{*}$ refers to AP$_{50}$.
Figure 2: Overview of the Spatially-Sparse Linear Attention (SSLA) Module. (1) and (2) are $2\times 2$ sliding window patch examples on the spatial domain. Each event is (a) scattered into multiple patches that spatially cover it (e.g.$e_i$ to patch (1) and (2). For brevity, we did not show other patches that cover $e_i$). Each patch gets a subsequence of the events that it covers, and computes the interim outputs independently (b) with linear attention (f), and the interim outputs of one event are (c) gathered as the final output. In each patch, the embeddings are projected based on their relative position within the patch (e).
Figure 3: Overview of the SSLA-Det model.Left:Fully asynchronous event-based detector. The input data is processed by a 4-stage asynchronous backbone. Each stage contains 2 SSLA layers followed by sparse pooling and temporal dropout. Right: SSLA layer. Input event embeddings are processed by the SSLA module, including two position-aware projections, scatter, parallel patch-wise linear attention and gather. A residual connection and a layer normalization is used for training stability.
Figure 4: Visualization of the detection results on the Gen1 dataset. Green boxes denote ground truth and orange boxes denote predictions with confidence scores. Predicted boxes with confidence scores below 0.5 are removed. (a)-(d): Cars. (e)-(h): Pedestrians. (i)-(l): Failure cases.

Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

TL;DR

Abstract

Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (4)