Table of Contents
Fetching ...

E-TIDE: Fast, Structure-Preserving Motion Forecasting from Event Sequences

Biswadeep Sen, Benoit R. Cottereau, Nicolas Cuperlier, Terence Sim

Abstract

Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.

E-TIDE: Fast, Structure-Preserving Motion Forecasting from Event Sequences

Abstract

Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.

Paper Structure

This paper contains 30 sections, 20 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: E-TIDE overview. Polarity-separated event occurrence maps $\mathbf{X}=\{\mathbf{x}_t\}_{t=1}^{T_{\mathrm{in}}}$ are encoded with shared weights to obtain per-step features $\mathbf{e}_t$, which are packed along time into channels to form $\mathbf{Z}\in\mathbb{R}^{B\times D\times H'\times W'}$ with $D=T_{\mathrm{in}}C_s$ for fully-parallel spatiotemporal processing. The interaction core $\mathcal{C}(\cdot)$ is a stack of TIDE blocks: within each block we denote the input as $\mathbf{U}$ and output as $\mathbf{U}^{+}$ (with $\mathbf{U}=\mathbf{Z}$ for the first block), and the stacked core maps $\mathbf{Z}$ to $\mathbf{Z}^{+}=\mathcal{C}(\mathbf{Z})$. Unstack denotes a reshape/view of the packed tensor (time-in-channels) for visualization; the decoder operates on $\mathbf{Z}^{+}$. The decoder upsamples to future logits $\hat{\mathbf{S}}\in\mathbb{R}^{T_{\mathrm{out}}\times 2\times H\times W}$ and probabilities $\hat{\mathbf{Y}}=\sigma(\hat{\mathbf{S}})$. We detail the internal TIDE block design in Fig. \ref{['fig:tide_module']}.
  • Figure 2: TIDE block (used in Fig. \ref{['fig:etide_overview']}). Given packed features $\mathbf{U}$, static mixing performs large-kernel depthwise spatial aggregation followed by a pointwise projection to produce mixed features $\mathbf{F}$, while dynamic gating computes an activity mask (quantile $q$), pools only active regions, and generates a channel gate $\mathbf{g}$ via an MLP and sigmoid. The two paths fuse through the multiplicative residual interaction $\mathbf{U}^{+}=\mathbf{g}\odot\mathbf{F}\odot(\mathbf{1}+\mathbf{U})$.
  • Figure 3: Qualitative comparison on eTraM. Example $10\!\rightarrow\!10$ forecasts showing input event frames and future predictions at $t{+}1$, $t{+}5$, and $t{+}10$. E-TIDE produces motion-aligned, temporally consistent event structure that better preserves thin boundaries compared to the event-native baseline E-Motion. Red/blue indicate ON/OFF polarities.
  • Figure 4: Accuracy--efficiency Pareto trade-offs. We plot aIoU (higher is better) against model size, inference latency, and peak VRAM (log-scale where indicated). Across all 3 plots, E-TIDE lies in the top-left frontier, ie. it beats the other models in size, latency and memory usage.
  • Figure 5: Downstream tasks Qualitative results for micro-mobility, cars, and pedestrians. Predicted event frames remain structurally faithful to the ground truth, enabling reliable segmentation and tracking across diverse object categories and forecast horizons.
  • ...and 1 more figures