Table of Contents
Fetching ...

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, Dan Xu

TL;DR

HAMF tackles the challenge of accurate multi-modal motion forecasting by unifying scene-context understanding and future motion representation learning within a single encoder. It injects learnable future-motion tokens into the scene encoding and uses a hybrid Attention-based encoder paired with a Mamba-based decoder to capture long-range dependencies and inter-token relationships. The approach yields state-of-the-art results on the Argoverse 2 benchmark with a lightweight model (≈3.0M parameters) and real-time inference, validated by extensive ablations that highlight the benefit of token-based interaction and sequential modeling. This design offers a practical, scalable solution for autonomous driving systems seeking accurate and diverse trajectory predictions while maintaining efficiency.

Abstract

Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

TL;DR

HAMF tackles the challenge of accurate multi-modal motion forecasting by unifying scene-context understanding and future motion representation learning within a single encoder. It injects learnable future-motion tokens into the scene encoding and uses a hybrid Attention-based encoder paired with a Mamba-based decoder to capture long-range dependencies and inter-token relationships. The approach yields state-of-the-art results on the Argoverse 2 benchmark with a lightweight model (≈3.0M parameters) and real-time inference, validated by extensive ablations that highlight the benefit of token-based interaction and sequential modeling. This design offers a practical, scalable solution for autonomous driving systems seeking accurate and diverse trajectory predictions while maintaining efficiency.

Abstract

Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

Paper Structure

This paper contains 25 sections, 6 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Comparison of existing motion forecasting frameworks (a, b) and our proposed method (c). The bounding boxes in purple denote the surrounding traffic participants and the box in yellow denotes the focal agent. Historical trajectories are presented with the gradient blue lines and the ground-truth future trajectory as the prediction target is presented with the gradient pink line. Differently from direct-decoding approaches and learnable anchor-based methods, we formulate the future motion feature as a set of learnable tokens and input them into the encoder with embedded scene context tokens, to acquire more comprehensive future motion representations within the scene understanding stage.
  • Figure 2: Overview of our proposed HAMF. The left part presents the input embedding module with an intersection driving scenario. The ground truth future trajectory is shown with the gradient pink line for illustration purposes, which is not used in the input. The historical trajectories and surrounding map are embedded and combined as initial scene tokens $S^0$, then concatenated with the initial future motion tokens $F^0$ for the input of the unified encoder. The middle part denotes the encoding process within the $l$-th encoder layer. The dashed lines represent the input for the subsequent encoding layer. With $L_{enc}$ iterations in the encoder, the learned future motion tokens $F^{L_{enc}}$ are obtained and decoded with the Mamba block and multi-layer MLPs to generate the final prediction, shown in the right part.
  • Figure 3: Qualitative results of our proposed method, the query-based method $\mathcal{M}_q$ and the base model $\mathcal{M}_b$ on four challenging scenarios of AV2 validation set. Surrounding agents are represented by the bounding boxes in purple and the focal agent in yellow. The line in gradient pink indicates the ground truth and the line in deep blue indicates the multi-modal predicted trajectory.