Table of Contents
Fetching ...

FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration

Yizhou Huang, Gengze Jiang, Yihua Cheng, Kezhi Wang

TL;DR

FoSS is a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling, and achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%.

Abstract

Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.

FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration

TL;DR

FoSS is a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling, and achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%.

Abstract

Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
Paper Structure (18 sections, 13 equations, 3 figures, 5 tables)

This paper contains 18 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The relationship between input modalities, classical Fourier decomposition, and our HelixSort progressive scanning. Low-frequency inputs such as traffic rules and maps encode global priors, while high-frequency inputs such as vehicle status and position contain fine-grained dynamics. HelixSort progressively scans and aligns these heterogeneous frequency cues, forming an ordered frequency sequence for selective state-space modeling.
  • Figure 2: Overview of the proposed framework. Historical trajectories are processed in two parallel branches: (i) frequency branch (FD-Mamba) applies a DFT and reorders the spectrum with a progressive HelixSort in two selective state-space blocks. Coarse2Fine-SSM for spatial interaction and SpecEvolve-SSM for channel evolution; (ii) time branch (TD-Mamba) passes the raw sequence through an input-dependent selective SSM where each state-transition matrix is dynamically generated from the current observation and its local Conv1D features. This design allows the temporal dynamics to adaptively evolve over time, effectively mimicking self-attention behavior with linear complexity.The resulting features are fused via a cross-attention layer, after which a learnable query set decodes $K$ candidate futures whose uncertainty-aware weights yield the final trajectory prediction.
  • Figure 3: Qualitative results on Argoverse 2 TrustButVerifyArgoverse2, covering U-turn (e), lane change (b,f), turning (a,c,h), and straight-driving scenes (d,g). Red: past trajectories; orange: ground truth; green: predicted trajectories. The model generates smooth and diverse trajectories aligned with road geometry, with slight jitter (f) in frequent lane-change cases, possibly due to transient high-frequency motion cues.