Table of Contents
Fetching ...

SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting

Alexander Prutsch, Christian Fruhwirth-Reisinger, David Schinagl, Horst Possegger

Abstract

In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.

SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting

Abstract

In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.

Paper Structure

This paper contains 43 sections, 7 figures, 18 tables.

Figures (7)

  • Figure 1: In real-world driving scenes, the available context history for different agents is heterogeneous. Therefore, forecasting models should be able to make accurate predictions for both long-term (brown car) and short-term (green car) contexts, where agents have been observed for a long time or have recently entered the field of view. While existing models often struggle with such varying context lengths, our SHARP is explicitly designed to provide accurate forecasts in such dynamically evolving scenes. Dotted lines indicate the available observations, while transparent paths show the corresponding future trajectory.
  • Figure 2: To jointly incorporate newly detected agents and long-term agent histories when forecasting motions in evolving scenes, we leverage a streaming-based motion forecasting model. The example at time step $t-1$ illustrates the standard model pass without streaming context, consisting of separate agent and lane encoders ($f_A$ and $f_L$), a scene encoder ($f_S$), and a trajectory decoder ($f_D$) with a streaming refinement module ($f_R$). At the next time step $t$, we integrate the previous scene context $S_{\text{enc}}^{t-1}$ via an instance-aware context streamer ($f_{\text{IA}}$) and generate auxiliary target-centric features $C^t$ by aggregating scene elements closely around the endpoints of the previous predictions $F^{t-1}$. To improve robustness, we also perform, only during training, a parallel model pass without streaming modules, producing $F^t_{\text{chunk}}$ which is used to compute the $\mathcal{L}_\text{chunk}$ objective.
  • Figure 3: Comparison of different evaluation setups on a motion forecasting dataset with $H_t$ as historical context and $F_t$ as future for the prediction task. The standard benchmark evaluation only considers a single history/future split per scenario (first row). In our experiments, we test the models at different time steps $t_p^i$ into the scenario and using varying context lengths $T^i_{\text{cl}}$ as model input and also evaluating for different future horizons $T^i_f$.
  • Figure 4: Comparison of motion forecasting processing schemes: standard snapshot-based forecasting, the streaming-based forecasting used by RealMotion song2024realmotion and Demo zhang2024demo (long, overlapping, observation windows) and our proposed approach (compact, subsequent, observation windows).
  • Figure 5: Details for evaluating related work on the evolving scene setting (main paper Table 1). The first example shows the standard execution on the AV2 benchmark. In the second example, we increase the input context by simply executing another streaming step. In the last example, we test the models with fewer context by evaluating after two streaming steps.
  • ...and 2 more figures