Table of Contents
Fetching ...

Spritz: Path-Aware Load Balancing in Low-Diameter Networks

Tommaso Bonato, Ales Kubicek, Abdul Kabbani, Ahmad Ghalayini, Maciej Besta, Torsten Hoefler

TL;DR

Spritz, a flexible sender-based load balancing framework that shifts adaptive topology-aware routing to the endpoints using only standard Ethernet features, is introduced, offering unified routing and load balancing for the Ultra Ethernet era.

Abstract

Low-diameter topologies such as Dragonfly and Slim Fly are increasingly adopted in HPC and datacenter networks, yet existing load balancing techniques either rely on proprietary in-network mechanisms or fail to utilize the full path diversity of these topologies. We introduce Spritz, a flexible sender-based load balancing framework that shifts adaptive topology-aware routing to the endpoints using only standard Ethernet features. We propose two algorithms, Spritz-Scout and Spritz-Spray that, respectively, explore and adaptively cache efficient paths using ECN, packet trimming, and timeout feedback. Through simulation on Dragonfly and Slim Fly topologies with over 1000 endpoints, Spritz outperforms ECMP, UGAL-L, and prior sender-based approaches by up to 1.8x in flow completion time under AI training and datacenter workloads, while offering robust failover with performance improvements of up to 25.4x under link failures, all without additional hardware support. Spritz enables datacenter-scale, commodity Ethernet networks to efficiently leverage low-diameter topologies, offering unified routing and load balancing for the Ultra Ethernet era.

Spritz: Path-Aware Load Balancing in Low-Diameter Networks

TL;DR

Spritz, a flexible sender-based load balancing framework that shifts adaptive topology-aware routing to the endpoints using only standard Ethernet features, is introduced, offering unified routing and load balancing for the Ultra Ethernet era.

Abstract

Low-diameter topologies such as Dragonfly and Slim Fly are increasingly adopted in HPC and datacenter networks, yet existing load balancing techniques either rely on proprietary in-network mechanisms or fail to utilize the full path diversity of these topologies. We introduce Spritz, a flexible sender-based load balancing framework that shifts adaptive topology-aware routing to the endpoints using only standard Ethernet features. We propose two algorithms, Spritz-Scout and Spritz-Spray that, respectively, explore and adaptively cache efficient paths using ECN, packet trimming, and timeout feedback. Through simulation on Dragonfly and Slim Fly topologies with over 1000 endpoints, Spritz outperforms ECMP, UGAL-L, and prior sender-based approaches by up to 1.8x in flow completion time under AI training and datacenter workloads, while offering robust failover with performance improvements of up to 25.4x under link failures, all without additional hardware support. Spritz enables datacenter-scale, commodity Ethernet networks to efficiently leverage low-diameter topologies, offering unified routing and load balancing for the Ultra Ethernet era.
Paper Structure (49 sections, 1 equation, 9 figures, 3 tables, 3 algorithms)

This paper contains 49 sections, 1 equation, 9 figures, 3 tables, 3 algorithms.

Figures (9)

  • Figure 1: Overview of the Spritz framework.
  • Figure 2: Decision logic on each switch (left). Source-Guided Adaptive Routing in Dragonfly and Slim Fly (right). The routing process from a source endpoint to a destination endpoint involves at most three steps. Control of the first hop at the ECMP 1 switch using EV1. Control of the second hop at the ECMP 2 switch using EV2 (this switch is always a direct neighbour of the ECMP 1 switch). Reaching an intermediate location . From there, default forwarding table is used to reach the destination. The source uses EV1 and EV2 to guide the routing decisions. Based on these values, the path can be either minimal () or non-minimal (+ ) with various length.
  • Figure 3: Upper bound on the memory required per endpoint to store the endpoint table. For each destination switch, the source endpoint stores an EV entry list, enumerating bounded simple paths to reach that destination (sorted by latency). We show how the requirements scale for Dragonfly and Slim Fly topologies up to 40k endpoints when enumerating all bounded simple paths. Paths refers to the maximum number of paths for any source-destination pair.
  • Figure 4: Main components of the Spritz framework. Load a particular EV entry list from the endpoint table when a particular connection is established. Send logic to control the path preference in the network. Feedback logic to gather information about good paths and bad paths.
  • Figure 5: Illustration of a motivational example showing fast adaptation to non-congested groups (paths) in a Dragonfly topology. We observe the flow completion time (FCT) of a single monitored 4 MiB flow () between a specific source and destination endpoints. Most groups are heavily congested: many flows target a switch with a global link to the destination group, creating significant queue buildup. Few groups do not have this traffic and thus congestion (free groups). The goal of the load balancing scheme is to quickly discover and use these target paths.
  • ...and 4 more figures