State Space Models for Event Cameras

Nikola Zubić; Mathias Gehrig; Davide Scaramuzza

State Space Models for Event Cameras

Nikola Zubić, Mathias Gehrig, Davide Scaramuzza

TL;DR

This work tackles the challenge of generalizing event-camera processing across varying inference frequencies by introducing state-space models with learnable timescales embedded in a ViT backbone (SSM-ViT). The approach leverages continuous-time S4/S5 families to enable frequency-adaptive temporal aggregation, achieving faster training and robust performance when deployed at frequencies different from training conditions. Two aliasing-countermeasures—frequency-selective masking and $H_{2}$ norm regularization—mitigate high-frequency artifacts, enabling reliable operation at higher temporal resolutions. Empirically, the SSM-ViT models provide competitive mAP on Gen1 and 1 Mpx datasets while reducing training time and maintaining small performance degradation across frequency shifts, demonstrating practical value for high-speed event-based vision systems.

Abstract

Today, state-of-the-art deep neural networks that process event-camera data first convert a temporal window of events into dense, grid-like input representations. As such, they exhibit poor generalizability when deployed at higher inference frequencies (i.e., smaller temporal windows) than the ones they were trained on. We address this challenge by introducing state-space models (SSMs) with learnable timescale parameters to event-based vision. This design adapts to varying frequencies without the need to retrain the network at different frequencies. Additionally, we investigate two strategies to counteract aliasing effects when deploying the model at higher frequencies. We comprehensively evaluate our approach against existing methods based on RNN and Transformer architectures across various benchmarks, including Gen1 and 1 Mpx event camera datasets. Our results demonstrate that SSM-based models train 33% faster and also exhibit minimal performance degradation when tested at higher frequencies than the training input. Traditional RNN and Transformer models exhibit performance drops of more than 20 mAP, with SSMs having a drop of 3.76 mAP, highlighting the effectiveness of SSMs in event-based vision tasks.

State Space Models for Event Cameras

TL;DR

norm regularization—mitigate high-frequency artifacts, enabling reliable operation at higher temporal resolutions. Empirically, the SSM-ViT models provide competitive mAP on Gen1 and 1 Mpx datasets while reducing training time and maintaining small performance degradation across frequency shifts, demonstrating practical value for high-speed event-based vision systems.

Abstract

Paper Structure (35 sections, 24 equations, 5 figures, 6 tables)

This paper contains 35 sections, 24 equations, 5 figures, 6 tables.

Introduction
Related Work
Object detection with Event Cameras
Continuous-time Models
Method
Preliminaries
SSM-ViT block
Low-pass bandlimiting
Output Masking
$\textbf{H}_{\textbf{2}}$ Norm
Experiments
Setup
Benchmark comparisons
Ablation study
SSMs: initializations & bandlimiting
...and 20 more sections

Figures (5)

Figure 1: Top-Left Previous works Gehrig_2023_CVPRperot_nips2020 use RNN architectures with convolutional or attention mechanisms to train models that have superior performance on downstream tasks. However, the use of RNNs leads to slower training, and the learned weights only generalize well to data deployed at the same frequency as that used at training time. Bottom-Left We solve this problem by utilizing SSMs for temporal aggregation, which enables faster training by either utilizing the S4 model gu2022efficiently or S5 smith2023simplified parallel scans. By their nature, these models allow deployment at different frequencies than those used at training time since they have a learnable timescale parameter. Right Our SSM-based models achieve an average performance drop between training and testing frequencies of 3.76 mAP averaged on both Gen1 Tournemire20arxiv and 1 Mpx perot_nips2020 datasets, while RVT Gehrig_2023_CVPR and GET Peng_2023_ICCV have a drop of 21.25 and 24.53 mAP, respectively.
Figure 2: SSM-ViT block structure
Figure 3: In the S4 layer, each dimension of the input sequence $u_{1:L} \in \mathbb{R}^{L \times H}$ is processed by a separate SSM. This process involves using a Cauchy kernel to determine the coefficients for frequency domain convolutions. The convolutions, done via FFTs, generate the output $y_{1:L} \in \mathbb{R}^{L \times H}$ for each SSM. The outputs then go through a nonlinear activation function, which includes a layer that mixes them to produce the final output of the layer.
Figure 4: For the S5 layer, a parallel scan technique is employed on a diagonal linear SSM to get the SSM outputs. This approach bypasses the need for frequency domain operations and convolution kernel computations required by S4, resulting in a model that functions in a purely time-domain, recurrent manner. Recurrence is parallelized with the employment of parallel scans BlellochTR90.
Figure 5: Detections on DSEC dataset Gehrig2021ral_a with model trained on 1 Mpx dataset perot_nips2020. Names of the specific DSEC scenes are in the subcaptions.

Theorems & Definitions (1)

proof

State Space Models for Event Cameras

TL;DR

Abstract

State Space Models for Event Cameras

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)