Table of Contents
Fetching ...

Future-Aware Interaction Network For Motion Forecasting

Shijie Li, Xun Xu, Si Yong Yeo, Xulei Yang

TL;DR

FINet addresses the multimodal motion forecasting problem by integrating potential future trajectories into scene encoding, enabling joint optimization of historical and future states. It combines a Lightweight Scene Encoder, Future-Aware Mamba with an Adaptive Reorder Strategy, and a Temporal Enhanced Decoder to produce diverse, temporally coherent trajectories with linear-scaling efficiency via the Mamba State Space Model. The method achieves state-of-the-art or competitive results on Argoverse 1 and 2, with substantial reductions in latency, memory, and FLOPs compared to transformer-based baselines. This work advances practical, real-time motion forecasting for autonomous driving by improving accuracy and efficiency while capturing diverse plausible futures.

Abstract

Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according to the acceptance.

Future-Aware Interaction Network For Motion Forecasting

TL;DR

FINet addresses the multimodal motion forecasting problem by integrating potential future trajectories into scene encoding, enabling joint optimization of historical and future states. It combines a Lightweight Scene Encoder, Future-Aware Mamba with an Adaptive Reorder Strategy, and a Temporal Enhanced Decoder to produce diverse, temporally coherent trajectories with linear-scaling efficiency via the Mamba State Space Model. The method achieves state-of-the-art or competitive results on Argoverse 1 and 2, with substantial reductions in latency, memory, and FLOPs compared to transformer-based baselines. This work advances practical, real-time motion forecasting for autonomous driving by improving accuracy and efficiency while capturing diverse plausible futures.

Abstract

Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according to the acceptance.

Paper Structure

This paper contains 17 sections, 25 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Query-based
  • Figure 2: Intention-based
  • Figure 4: Previous methods can be categorized into MLP-based or Query-based approaches, where future trajectories are absent during the scene encoding stage. In contrast, we propose an Interaction-based method that models future trajectories in advance and seamlessly integrates them into scene encoding, enabling a more comprehensive representation.
  • Figure 5: The proposed Future-Aware Interaction Mamba (FIM). The future potential trajectories will be modeled in advance and then integrated into the scene encoding. By enabling the model to be aware of future states, a more comprehensive representation can be learned.
  • Figure 6: The proposed Temporal Enhanced Decoder (TEDec). For simplicity, we omit reshape operation but indicate the tensor shape at each step.
  • ...and 1 more figures