Table of Contents
Fetching ...

Motion Forecasting in Continuous Driving

Nan Song, Bozhou Zhang, Xiatian Zhu, Li Zhang

TL;DR

RealMotion targets motion forecasting in continuous driving by addressing the mismatch between practical streaming scenarios and traditional per-scene forecasting. It introduces a two-stream, encoder–decoder architecture: a scene context stream that accumulates historical context via cross-attention and a separate agent trajectory stream that refines predictions using a memory bank of past trajectories and Trajectory Embedding. A data reorganization strategy reshapes existing benchmarks into continuous sub-scenes to better emulate real-world driving. Empirical results on Argoverse datasets show state-of-the-art performance with efficient online inference, and ablations confirm the complementary benefits of the data reorganization and the dual-stream design. Overall, the work provides a practical framework for temporally coherent, multimodal motion forecasting in autonomous driving, with clear guidance on deployment considerations and limitations.

Abstract

Motion forecasting for agents in autonomous driving is highly challenging due to the numerous possibilities for each agent's next action and their complex interactions in space and time. In real applications, motion forecasting takes place repeatedly and continuously as the self-driving car moves. However, existing forecasting methods typically process each driving scene within a certain range independently, totally ignoring the situational and contextual relationships between successive driving scenes. This significantly simplifies the forecasting task, making the solutions suboptimal and inefficient to use in practice. To address this fundamental limitation, we propose a novel motion forecasting framework for continuous driving, named RealMotion. It comprises two integral streams both at the scene level: (1) The scene context stream progressively accumulates historical scene information until the present moment, capturing temporal interactive relationships among scene elements. (2) The agent trajectory stream optimizes current forecasting by sequentially relaying past predictions. Besides, a data reorganization strategy is introduced to narrow the gap between existing benchmarks and real-world applications, consistent with our network. These approaches enable exploiting more broadly the situational and progressive insights of dynamic motion across space and time. Extensive experiments on Argoverse series with different settings demonstrate that our RealMotion achieves state-of-the-art performance, along with the advantage of efficient real-world inference. The source code will be available at https://github.com/fudan-zvg/RealMotion.

Motion Forecasting in Continuous Driving

TL;DR

RealMotion targets motion forecasting in continuous driving by addressing the mismatch between practical streaming scenarios and traditional per-scene forecasting. It introduces a two-stream, encoder–decoder architecture: a scene context stream that accumulates historical context via cross-attention and a separate agent trajectory stream that refines predictions using a memory bank of past trajectories and Trajectory Embedding. A data reorganization strategy reshapes existing benchmarks into continuous sub-scenes to better emulate real-world driving. Empirical results on Argoverse datasets show state-of-the-art performance with efficient online inference, and ablations confirm the complementary benefits of the data reorganization and the dual-stream design. Overall, the work provides a practical framework for temporally coherent, multimodal motion forecasting in autonomous driving, with clear guidance on deployment considerations and limitations.

Abstract

Motion forecasting for agents in autonomous driving is highly challenging due to the numerous possibilities for each agent's next action and their complex interactions in space and time. In real applications, motion forecasting takes place repeatedly and continuously as the self-driving car moves. However, existing forecasting methods typically process each driving scene within a certain range independently, totally ignoring the situational and contextual relationships between successive driving scenes. This significantly simplifies the forecasting task, making the solutions suboptimal and inefficient to use in practice. To address this fundamental limitation, we propose a novel motion forecasting framework for continuous driving, named RealMotion. It comprises two integral streams both at the scene level: (1) The scene context stream progressively accumulates historical scene information until the present moment, capturing temporal interactive relationships among scene elements. (2) The agent trajectory stream optimizes current forecasting by sequentially relaying past predictions. Besides, a data reorganization strategy is introduced to narrow the gap between existing benchmarks and real-world applications, consistent with our network. These approaches enable exploiting more broadly the situational and progressive insights of dynamic motion across space and time. Extensive experiments on Argoverse series with different settings demonstrate that our RealMotion achieves state-of-the-art performance, along with the advantage of efficient real-world inference. The source code will be available at https://github.com/fudan-zvg/RealMotion.
Paper Structure (35 sections, 5 equations, 7 figures, 9 tables)

This paper contains 35 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of (a) existing methods independently processing each scene and (b) our RealMotion recurrently collecting historical information. (c) For example, RealMotion can perceive the currently invisible pedestrian and predict the giving way for the interested agent.
  • Figure 2: Illustration of our data reorganization strategy, processing (a) a given independent scene by (b) chunking the trajectories into segments and aggregating surrounding elements, generating the (c) continuous sub-scenes.
  • Figure 3: Overview of our RealMotion architecture. RealMotion adopts an encoder-decoder structure with two intermediate streams designed to capture interactive relationships within each scene and across the continuous scenes. The (a) Scene context stream and (b) Agent trajectory stream iteratively accumulate information for the scene context and rectify the prediction, respectively. The (c) context referencing and (d) trajectory relaying modules are specially-designed cross-attention mechanism for each stream.
  • Figure 4: Qualitative results on the Argoverse 2 validation set. The panel (a)-(c) demonstrate the progressive forecasting results of our RealMotion, where the panel (c) is the final predictions for evaluation. The panel (d) shows the one-shot forecasting of RealMotion-I.
  • Figure 5: Failure cases. In the first row, the model fails to predict the turning behavior at complex intersections, while in the second row, it fails to predict the parking behavior.
  • ...and 2 more figures