Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

Zhiying Xu; Francis Y. Yan; Rachee Singh; Justin T. Chiu; Alexander M. Rush; Minlan Yu

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

Zhiying Xu, Francis Y. Yan, Rachee Singh, Justin T. Chiu, Alexander M. Rush, Minlan Yu

TL;DR

Teal tackles the scalability gap in WAN traffic engineering by marrying a flow-centric graph neural network (FlowGNN) with per-demand multi-agent reinforcement learning and fast ADMM-based fine-tuning. This design enables thousands of GPU-parallel flows to be allocated quickly while maintaining high TE quality, outperforming state-of-the-art acceleration schemes by delivering near-optimal allocations in seconds on large topologies. The approach supports multiple TE objectives and rapid re-optimizations in response to link failures or demand shifts, making it practical for dynamic cloud WAN environments. The work demonstrates substantial practical impact by enabling real-time or near-real-time TE decisions at planet-scale WANs, and provides public code to foster further research and deployment.

Abstract

The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance. We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links. We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 18 figures, 2 tables)

This paper contains 27 sections, 10 equations, 18 figures, 2 tables.

Introduction
Background and Motivation
Scaling challenges of TE
Accelerate TE optimization with ML
Challenges of applying ML to TE
Teal: Learning-accelerated TE
Overview
Feature learning with FlowGNN
Flow allocation with multi-agent RL
Solution fine-tuning with ADMM
Implementation of Teal
Evaluation
Methodology
Teal vs. the state of the art
Reacting to link failures
...and 12 more sections

Figures (18)

Figure 1: Control loop of WAN traffic engineering.
Figure 2: On a topology with $>$1,700 nodes (ASN in Table \ref{['tab:topologies']}), the TE optimization using the Gurobi solver experiences a marginal speedup as more CPU threads become available.
Figure 3: Workflow of Teal. Teal inputs traffic demands and link capacities into FlowGNN to learn flow embeddings (§\ref{['sec:flow-gnn']}), which are then mapped to initial traffic allocations through multi-agent RL (§\ref{['sec:marl']}). ADMM subsequently fine-tunes the allocations and mitigates constraint violations (§\ref{['sec:admm']}).
Figure 4: Illustration of a FlowGNN construction. FlowGNN alternates between GNN layers that are designed to capture capacity constraints, and DNN layers that are intended to capture demand constraints.
Figure 5: Teal processes each demand independently using a shared, significantly smaller policy network.
...and 13 more figures

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

TL;DR

Abstract

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (18)