Teal: Learning-Accelerated Optimization of WAN Traffic Engineering
Zhiying Xu, Francis Y. Yan, Rachee Singh, Justin T. Chiu, Alexander M. Rush, Minlan Yu
TL;DR
Teal tackles the scalability gap in WAN traffic engineering by marrying a flow-centric graph neural network (FlowGNN) with per-demand multi-agent reinforcement learning and fast ADMM-based fine-tuning. This design enables thousands of GPU-parallel flows to be allocated quickly while maintaining high TE quality, outperforming state-of-the-art acceleration schemes by delivering near-optimal allocations in seconds on large topologies. The approach supports multiple TE objectives and rapid re-optimizations in response to link failures or demand shifts, making it practical for dynamic cloud WAN environments. The work demonstrates substantial practical impact by enabling real-time or near-real-time TE decisions at planet-scale WANs, and provides public code to foster further research and deployment.
Abstract
The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance. We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links. We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
