Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer Network for Trajectory Prediction
Weizheng Wang, Baijian Yang, Sungeun Hong, Wenhai Sun, Byung-Cheol Min
TL;DR
Hyper-STTN tackles crowd trajectory prediction by jointly modeling higher-order groupwise and pairwise social interactions. It integrates multiscale hypergraphs with a spatial-temporal transformer and fuses heterogeneous features via a multimodal transformer, followed by a CVAE decoder to handle stochasticity. The approach achieves state-of-the-art or competitive results on ETH-UCY and NBA datasets, with ablations confirming the contributions of both the hypergraph network and the transformer components. This framework advances trajectory forecasting in dense crowds and offers a scalable path toward real-time, multimodal-informed predictions in robotic and autonomous systems.
Abstract
Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving social robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. Hyper-STTN constructs multiscale hypergraphs of varying group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on random-walk probabilities. In parallel, a spatial-temporal transformer is employed to learn pedestrians' pairwise latent interactions across spatial and temporal dimensions. These heterogeneous groupwise and pairwise features are subsequently fused and aligned via a multimodal transformer. Extensive experiments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models.
