Table of Contents
Fetching ...

Motion-aware Event Suppression for Event Cameras

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

TL;DR

This work introduces the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark.

Abstract

In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.

Motion-aware Event Suppression for Event Cameras

TL;DR

This work introduces the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark.

Abstract

In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
Paper Structure (42 sections, 16 equations, 9 figures, 6 tables)

This paper contains 42 sections, 16 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Our method disentangles ego-motion events triggered by independently moving objects (IMOs). (a) We jointly learn to segment IMOs and predict dense optical flow for the next $\Delta t$. Warping the mask forward yields an anticipated future mask, enabling suppression of future events. (b) Compared to baselines like EVIMO (baseline), our approach produces tighter masks with higher IoU and fewer false positives. (c) Event Suppression can be used for downstream tasks: (top) accelerating segmentation via motion-guided token pruning, and (bottom) improving visual odometry by filtering out dynamic IMO edges.
  • Figure 2: Overview of the Anticipatory Motion Suppression pipeline A stack of input events $\mathcal{E}_{[t-\Delta t, t)}$ is processed by a network featuring a series of recurrent blocks $E_e$ followed by our proposed attention-based time conditioning (ATC) and task-specific decoders. The network jointly predicts a binary dynamic-object mask $M_t$ and a future dense optical flow $\psi_{t \rightarrow t + \Delta t_p}$. The predicted flow is used in a flow warping module (FW) to forecast the motion of dynamic regions. By propagating the mask forward in time, the system anticipates and suppresses future events $\mathcal{E}_{[t, t+\Delta t_p)}$ corresponding to either independently moving objects or the static background, yielding a simplified event stream focused on just one specific type of motion.
  • Figure 3: Overview of Attention-based Time Conditioning (ATC). Target time $\Delta t_p$ is mapped via Positional Encoding (PE) to an embedding of size $C$, matching the spatial feature channels. These temporal features are broadcasted to form a Query, while flattened spatial features serve as the Key and Value. Cross-attention modulates the spatial features based on the temporal query, yielding the time-conditioned embedding $E'$.
  • Figure 4: Anticipatory motion suppression on EVIMO (a) and DSEC (b). Top row: Events accumulated over $\Delta t_p = 100$ ms with ground-truth IMO masks. Bottom row: Predicted separation of ego-motion (white) and IMO (green) events for the future 100 ms window based on the previous 50 ms of data. Our pipeline demonstrates: (a) robustness under extreme ego-motion with complex IMO shapes and (b) accurate motion anticipation for both distant and nearby vehicles in driving scenes.
  • Figure 5: Dynamic object masks for SViT token pruning. Left to right: SViT-0 and SViT-16 show increasing mask dilation. Larger dilation factors cover more area, including more background tokens and increasing system latency. This mask expansion introduces redundancy that significantly reduces inference frequency while capturing the same dynamic content of the scene.
  • ...and 4 more figures