Table of Contents
Fetching ...

Stable and Fault-Tolerant Decentralized Traffic Engineering

Arjun Devraj, Umesh Krishnaswamy, Ying Zhang, Karuna Grewal, Justin Hsu, Eva Tardos, Rachee Singh

TL;DR

This paper addresses divergence-induced congestion in decentralized TE by introducing Symphony, which regularizes TE objectives with a quadratic penalty to achieve a unique, stable solution despite demand perturbations. It couples this algorithmic stability with a randomized slicing algorithm to minimize the blast radius of controller faults, preserving fault isolation. The approach yields 14x reductions in divergence-induced congestion and 79% reduction in blast radius on production WANs, while maintaining throughput and supporting scalable runtimes comparable to LP-based TE after optimizations. The work offers a practical, deployable path to robust decentralized TE with provable stability guarantees and improved resilience for planet-scale cloud WANs.

Abstract

Cloud providers have recently decentralized their wide-area network traffic engineering (TE) systems to contain the impact of TE controller failures. In the decentralized design, a controller fault only impacts its slice of the network, limiting the blast radius to a fraction of the network. However, we find that autonomous slice controllers can arrive at divergent traffic allocations that overload links by 30% beyond their capacity. We present Symphony, a decentralized TE system that addresses the challenge of divergence-induced congestion while preserving the fault-isolation benefits of decentralization. By augmenting TE objectives with quadratic regularization, Symphony makes traffic allocations robust to demand perturbations, ensuring TE controllers naturally converge to compatible allocations without coordination. In parallel, Symphony's randomized slicing algorithm partitions the network to minimize blast radius by distributing critical traffic sources across slices, preventing any single failure from becoming catastrophic. These innovations work in tandem: regularization ensures algorithmic stability to traffic allocations while intelligent slicing provides architectural resilience in the network. Through extensive evaluation on cloud provider WANs, we show Symphony reduces divergence-induced congestion by 14x and blast radius by 79% compared to current practice.

Stable and Fault-Tolerant Decentralized Traffic Engineering

TL;DR

This paper addresses divergence-induced congestion in decentralized TE by introducing Symphony, which regularizes TE objectives with a quadratic penalty to achieve a unique, stable solution despite demand perturbations. It couples this algorithmic stability with a randomized slicing algorithm to minimize the blast radius of controller faults, preserving fault isolation. The approach yields 14x reductions in divergence-induced congestion and 79% reduction in blast radius on production WANs, while maintaining throughput and supporting scalable runtimes comparable to LP-based TE after optimizations. The work offers a practical, deployable path to robust decentralized TE with provable stability guarantees and improved resilience for planet-scale cloud WANs.

Abstract

Cloud providers have recently decentralized their wide-area network traffic engineering (TE) systems to contain the impact of TE controller failures. In the decentralized design, a controller fault only impacts its slice of the network, limiting the blast radius to a fraction of the network. However, we find that autonomous slice controllers can arrive at divergent traffic allocations that overload links by 30% beyond their capacity. We present Symphony, a decentralized TE system that addresses the challenge of divergence-induced congestion while preserving the fault-isolation benefits of decentralization. By augmenting TE objectives with quadratic regularization, Symphony makes traffic allocations robust to demand perturbations, ensuring TE controllers naturally converge to compatible allocations without coordination. In parallel, Symphony's randomized slicing algorithm partitions the network to minimize blast radius by distributing critical traffic sources across slices, preventing any single failure from becoming catastrophic. These innovations work in tandem: regularization ensures algorithmic stability to traffic allocations while intelligent slicing provides architectural resilience in the network. Through extensive evaluation on cloud provider WANs, we show Symphony reduces divergence-induced congestion by 14x and blast radius by 79% compared to current practice.

Paper Structure

This paper contains 33 sections, 3 theorems, 9 equations, 23 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

The Hessian of $g(\mathbf{x})$ is

Figures (23)

  • Figure 1: \ref{['fig:divergence']} shows divergence in flow allocations between 2 pairs of slice controllers in a large commercial cloud WAN. \ref{['fig:slice-routing']} shows an example of divergence in a network with 2 slices for demand $a\rightarrow e$. Due to differences in demands predicted by different controllers, slice 1's controller allocated 100 Gbps of traffic to the $abcde$ path and 50 Gbps to the $afge$ path. In contrast, slice 2's controller allocated 50 Gbps to $abcde$ and 100 Gbps to $afge$. Since each controller programs allocations in its slice, router $c$ receives 100 Gbps of traffic when it had only allocated 50 Gbps along link $cd$, causing congestion on starred links.
  • Figure 2: Architecture of decentralized TE systems.
  • Figure 3: Differences in demand inputs across 6 slice-controller pairs.
  • Figure 4: The percentage of flow that exceeds link capacities for the maximum throughput objective.
  • Figure 5: The percentage of flow that exceeds link capacities for the maximum concurrent flow objective.
  • ...and 18 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Lemma 1
  • proof
  • Theorem 1
  • Definition 1
  • Definition 2
  • Definition 3