Less is More: Strategic Expert Selection Outperforms Ensemble Complexity in Traffic Forecasting
Walid Guettala, Yufan Zhao, László Gulyás
TL;DR
This work tackles traffic forecasting by integrating explicit road-network topology into a mixture-of-experts framework. It introduces TESTAM+ with a SpatioSemantic Expert and a memory-based routing mechanism, enabling dynamic, topology-aware spatial modeling while maintaining parallel, non-autoregressive forecasting. Empirically, TESTAM+ achieves state-of-the-art MAE on METR-LA ($2.99$ vs $3.38$ for MegaCRN) and PEMS-BAY ($1.63$ MAE for Ad/SS), with substantial latency reductions (up to $53.1\%$–$61.7\%$) compared to full ensembles, demonstrating that carefully designed, fewer experts can outperform larger ensembles. The findings advocate for domain-aware expert design and efficient routing in MoE architectures to enable real-time deployment in complex urban networks.
Abstract
Traffic forecasting is fundamental to intelligent transportation systems, enabling congestion mitigation and emission reduction in increasingly complex urban environments. While recent graph neural network approaches have advanced spatial temporal modeling, existing mixture of experts frameworks like Time Enhanced Spatio Temporal Attention Model (TESTAM) lack explicit incorporation of physical road network topology, limiting their spatial capabilities. We present TESTAM+, an enhanced spatio temporal forecasting framework that introduces a novel SpatioSemantic Expert integrating physical road topology with data driven feature similarity through hybrid graph construction. TESTAM+ achieves significant improvements over TESTAM: 1.3% MAE reduction on METR LA (3.10 vs. 3.14) and 4.1% improvement on PEMS BAY (1.65 vs. 1.72). Through comprehensive ablation studies, we discover that strategic expert selection fundamentally outperforms naive ensemble aggregation. Individual experts demonstrate remarkable effectiveness: the Adaptive Expert achieves 1.63 MAE on PEMS BAY, outperforming the original three expert TESTAM (1.72 MAE), while the SpatioSemantic Expert matches this performance with identical 1.63 MAE. The optimal Identity + Adaptive configuration achieves an 11.5% MAE reduction compared to state of the art MegaCRN on METR LA (2.99 vs. 3.38), while reducing inference latency by 53.1% compared to the full four expert TESTAM+. Our findings reveal that fewer, strategically designed experts outperform complex multi expert ensembles, establishing new state of the art performance with superior computational efficiency for real time deployment.
