Table of Contents
Fetching ...

Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer Network for Trajectory Prediction

Weizheng Wang, Baijian Yang, Sungeun Hong, Wenhai Sun, Byung-Cheol Min

TL;DR

Hyper-STTN tackles crowd trajectory prediction by jointly modeling higher-order groupwise and pairwise social interactions. It integrates multiscale hypergraphs with a spatial-temporal transformer and fuses heterogeneous features via a multimodal transformer, followed by a CVAE decoder to handle stochasticity. The approach achieves state-of-the-art or competitive results on ETH-UCY and NBA datasets, with ablations confirming the contributions of both the hypergraph network and the transformer components. This framework advances trajectory forecasting in dense crowds and offers a scalable path toward real-time, multimodal-informed predictions in robotic and autonomous systems.

Abstract

Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs multiscale hypergraphs of varying group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on random-walk probabilities. In parallel, a spatial-temporal transformer is employed to learn pedestrians' pairwise latent interactions across spatial and temporal dimensions. These heterogeneous groupwise and pairwise features are subsequently fused and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models.

Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer Network for Trajectory Prediction

TL;DR

Hyper-STTN tackles crowd trajectory prediction by jointly modeling higher-order groupwise and pairwise social interactions. It integrates multiscale hypergraphs with a spatial-temporal transformer and fuses heterogeneous features via a multimodal transformer, followed by a CVAE decoder to handle stochasticity. The approach achieves state-of-the-art or competitive results on ETH-UCY and NBA datasets, with ablations confirming the contributions of both the hypergraph network and the transformer components. This framework advances trajectory forecasting in dense crowds and offers a scalable path toward real-time, multimodal-informed predictions in robotic and autonomous systems.

Abstract

Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs multiscale hypergraphs of varying group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on random-walk probabilities. In parallel, a spatial-temporal transformer is employed to learn pedestrians' pairwise latent interactions across spatial and temporal dimensions. These heterogeneous groupwise and pairwise features are subsequently fused and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models.
Paper Structure (23 sections, 18 equations, 6 figures, 2 tables)

This paper contains 23 sections, 18 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: HHI feature illustration: groupwise HHI captures latent correlations among high-level perspectives on group behaviors, while pairwise spatial-temporal HHI represents individual influences.
  • Figure 2: Hyper-STTN neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent's long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block and a self-transformer block to abstract the uncertainty of multimodality crowd movements.
  • Figure 3: groupwise HHI Representation: i) We construct groupwise HHI with a set of multiscale hypergraphs, where each agent is queried in the feature space with varying 'k' in KNN to link multiscale hyperedges. ii) After constructing HHI hypergraphs, groupwise dependencies are captured by point-to-edge and edge-to-point phases with hypergraph spectral convolution operations.
  • Figure 4: Hybrid Spatial-Temporal Transformer Framework: Pedestrians' motion intents and dependencies are abstracted as spatial and temporal attention maps by multi-head attention mechanism of spatial-temporal transformer. Additionally, a multi-head cross attention mechanism is employed to align heterogeneous groupwise and pairwise features.
  • Figure 5: Comparison of Trajectories Visualizations: The trajectories visualized for Hyper-STTN and other algorithms tested on the same test case.
  • ...and 1 more figures