HeLoFusion: An Efficient and Scalable Encoder for Modeling Heterogeneous and Multi-Scale Interactions in Trajectory Prediction
Bingqing Wei, Lianmin Chen, Zhongyu Xia, Yongtao Wang
TL;DR
HeLoFusion introduces a locality-focused encoder that builds multi-scale local graphs around each agent to capture pairwise and group-wise interactions while explicitly handling agent heterogeneity through aggregation-decomposition message passing and type-specific projections. The three-stage architecture encodes motion, models localized heterogeneous interactions, and fuses context with map information to produce robust agent embeddings. On the Waymo Open Motion Dataset, it achieves state-of-the-art Soft mAP, mAP, and displacement metrics among comparable single-model, LIDAR-free methods, while demonstrating better memory efficiency than global-context approaches. This locality-based approach offers a scalable, efficient path forward for accurate autonomous driving motion forecasting. The work highlights that prioritizing nearby, type-aware interactions can yield substantial gains without prohibitive computational costs.
Abstract
Multi-agent trajectory prediction in autonomous driving requires a comprehensive understanding of complex social dynamics. Existing methods, however, often struggle to capture the full richness of these dynamics, particularly the co-existence of multi-scale interactions and the diverse behaviors of heterogeneous agents. To address these challenges, this paper introduces HeLoFusion, an efficient and scalable encoder for modeling heterogeneous and multi-scale agent interactions. Instead of relying on global context, HeLoFusion constructs local, multi-scale graphs centered on each agent, allowing it to effectively model both direct pairwise dependencies and complex group-wise interactions (\textit{e.g.}, platooning vehicles or pedestrian crowds). Furthermore, HeLoFusion tackles the critical challenge of agent heterogeneity through an aggregation-decomposition message-passing scheme and type-specific feature networks, enabling it to learn nuanced, type-dependent interaction patterns. This locality-focused approach enables a principled representation of multi-level social context, yielding powerful and expressive agent embeddings. On the challenging Waymo Open Motion Dataset, HeLoFusion achieves state-of-the-art performance, setting new benchmarks for key metrics including Soft mAP and minADE. Our work demonstrates that a locality-grounded architecture, which explicitly models multi-scale and heterogeneous interactions, is a highly effective strategy for advancing motion forecasting.
