Table of Contents
Fetching ...

Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM

Yizhou Huang, Yihua Cheng, Kezhi Wang

TL;DR

Trajectory Mamba introduces a selective state-space model to replace conventional self-attention in a three-encoder–decoder framework for motion prediction. By coupling joint polyline encoding with Cross-Tamba decoding and an RNN-based trajectory weighting module, the approach achieves near-linear efficiency while maintaining strong accuracy. Key contributions include the selective input-dependent SSM attention, the joint encoding of pedestrians and traffic signals, and a cross-state-space decoder that shares a unified scene representation across targets. Empirical results on Argoverse 1 and 2 show a four-fold FLOPs reduction and over 40% fewer parameters, with competitive or superior accuracy compared to prior SOTA methods, highlighting strong potential for real-time autonomous driving deployment.

Abstract

Motion prediction is crucial for autonomous driving, as it enables accurate forecasting of future vehicle trajectories based on historical inputs. This paper introduces Trajectory Mamba, a novel efficient trajectory prediction framework based on the selective state-space model (SSM). Conventional attention-based models face the challenge of computational costs that grow quadratically with the number of targets, hindering their application in highly dynamic environments. In response, we leverage the SSM to redesign the self-attention mechanism in the encoder-decoder architecture, thereby achieving linear time complexity. To address the potential reduction in prediction accuracy resulting from modifications to the attention mechanism, we propose a joint polyline encoding strategy to better capture the associations between static and dynamic contexts, ultimately enhancing prediction accuracy. Additionally, to balance prediction accuracy and inference speed, we adopted the decoder that differs entirely from the encoder. Through cross-state space attention, all target agents share the scene context, allowing the SSM to interact with the shared scene representation during decoding, thus inferring different trajectories over the next prediction steps. Our model achieves state-of-the-art results in terms of inference speed and parameter efficiency on both the Argoverse 1 and Argoverse 2 datasets. It demonstrates a four-fold reduction in FLOPs compared to existing methods and reduces parameter count by over 40% while surpassing the performance of the vast majority of previous methods. These findings validate the effectiveness of Trajectory Mamba in trajectory prediction tasks.

Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM

TL;DR

Trajectory Mamba introduces a selective state-space model to replace conventional self-attention in a three-encoder–decoder framework for motion prediction. By coupling joint polyline encoding with Cross-Tamba decoding and an RNN-based trajectory weighting module, the approach achieves near-linear efficiency while maintaining strong accuracy. Key contributions include the selective input-dependent SSM attention, the joint encoding of pedestrians and traffic signals, and a cross-state-space decoder that shares a unified scene representation across targets. Empirical results on Argoverse 1 and 2 show a four-fold FLOPs reduction and over 40% fewer parameters, with competitive or superior accuracy compared to prior SOTA methods, highlighting strong potential for real-time autonomous driving deployment.

Abstract

Motion prediction is crucial for autonomous driving, as it enables accurate forecasting of future vehicle trajectories based on historical inputs. This paper introduces Trajectory Mamba, a novel efficient trajectory prediction framework based on the selective state-space model (SSM). Conventional attention-based models face the challenge of computational costs that grow quadratically with the number of targets, hindering their application in highly dynamic environments. In response, we leverage the SSM to redesign the self-attention mechanism in the encoder-decoder architecture, thereby achieving linear time complexity. To address the potential reduction in prediction accuracy resulting from modifications to the attention mechanism, we propose a joint polyline encoding strategy to better capture the associations between static and dynamic contexts, ultimately enhancing prediction accuracy. Additionally, to balance prediction accuracy and inference speed, we adopted the decoder that differs entirely from the encoder. Through cross-state space attention, all target agents share the scene context, allowing the SSM to interact with the shared scene representation during decoding, thus inferring different trajectories over the next prediction steps. Our model achieves state-of-the-art results in terms of inference speed and parameter efficiency on both the Argoverse 1 and Argoverse 2 datasets. It demonstrates a four-fold reduction in FLOPs compared to existing methods and reduces parameter count by over 40% while surpassing the performance of the vast majority of previous methods. These findings validate the effectiveness of Trajectory Mamba in trajectory prediction tasks.

Paper Structure

This paper contains 18 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of proposed joint polyline encoding strategy, where we consider all factors that affect the movement of motor vehicles, jointly encoding pedestrians and traffic lights rather than categorizing the scene horizontally into static and dynamic context. Additionally, we decompose and encode the interactions between all agents and elements at each time step.
  • Figure 2: Overview of our Tamba encoder. We employ joint polyline encoding strategy to integrate strongly correlated polyline information and used three parallel encoders to interact with and associate these features. We applied a linear projection to map the attention output of Tamba back to the same dimensions as the input and used normalization to enhance the stability of the output.
  • Figure 3: Overview of our Tamba decoding process. The outputs of the three Atten-Tamba encoders carry the context of both the static scene and dynamic agent states, which are concatenated and input into the Cross-Tamba decoder. For predicting the target agent's trajectory, the decoder queries $K$ trajectory modes using independent query tensors, while the encoder's features combine the current state with recursive reasoning from the previous step, allowing all target agents to share a unified scene representation. During the secondary decoding, the predicted state of the proposed trajectory interacts once again with the current scene information, and a recurrent network assigns prediction weights to the proposed trajectories. The final output trajectory is obtained through refinement of these predictions.
  • Figure 4: Qualitative results on the Argoverse 2 validation set. The target agent marked as orange with its surrounding agents marked as cold white. We conduct four different traffic scenarios. (a). Multi-agents in straight-road scenario, (b). Multiple agents on roundabout road, (c). Vehicle avoidance after roundabout, (d). Mixed scenario involving pedestrians and vehicles.