HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

Xiaodong Mei; Sheng Wang; Jie Cheng; Yingbing Chen; Dan Xu

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, Dan Xu

TL;DR

HAMF tackles the challenge of accurate multi-modal motion forecasting by unifying scene-context understanding and future motion representation learning within a single encoder. It injects learnable future-motion tokens into the scene encoding and uses a hybrid Attention-based encoder paired with a Mamba-based decoder to capture long-range dependencies and inter-token relationships. The approach yields state-of-the-art results on the Argoverse 2 benchmark with a lightweight model (≈3.0M parameters) and real-time inference, validated by extensive ablations that highlight the benefit of token-based interaction and sequential modeling. This design offers a practical, scalable solution for autonomous driving systems seeking accurate and diverse trajectory predictions while maintaining efficiency.

Abstract

Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

TL;DR

Abstract

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)