Table of Contents
Fetching ...

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Nils Blach, Maciej Besta, Daniele De Sensi, Jens Domke, Hussein Harake, Shigang Li, Patrick Iff, Marek Konieczny, Kartik Lakhotia, Ales Kubicek, Marcel Ferrari, Fabrizio Petrini, Torsten Hoefler

TL;DR

This paper reports the first real-world deployment of Slim Fly, a diameter-2 interconnect designed to reduce cost and power while maintaining high performance. It introduces a novel high-performance multipath routing based on layered FatPaths that enables multiple disjoint, near-minimal paths without excessive deadlock-avoidance constraints, and demonstrates its effectiveness on a 200-node InfiniBand cluster. Through extensive evaluation against a 2-level Fat Tree and diverse workloads—including microbenchmarks, scientific HPC tests, and distributed deep learning proxies—the Slim Fly system achieves comparable or superior performance to Fat Trees while delivering strong scalability and substantial cost savings at large scales. The work further provides automated cabling scripts, correctness verification tooling, and portable routing architecture, facilitating practical deployment of low-diameter interconnects beyond the SF installation presented.

Abstract

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

TL;DR

This paper reports the first real-world deployment of Slim Fly, a diameter-2 interconnect designed to reduce cost and power while maintaining high performance. It introduces a novel high-performance multipath routing based on layered FatPaths that enables multiple disjoint, near-minimal paths without excessive deadlock-avoidance constraints, and demonstrates its effectiveness on a 200-node InfiniBand cluster. Through extensive evaluation against a 2-level Fat Tree and diverse workloads—including microbenchmarks, scientific HPC tests, and distributed deep learning proxies—the Slim Fly system achieves comparable or superior performance to Fat Trees while delivering strong scalability and substantial cost savings at large scales. The work further provides automated cabling scripts, correctness verification tooling, and portable routing architecture, facilitating practical deployment of low-diameter interconnects beyond the SF installation presented.

Abstract

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.
Paper Structure (57 sections, 3 equations, 21 figures, 4 tables, 1 algorithm)

This paper contains 57 sections, 3 equations, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: First real-world deployment of the Slim Fly topology. The left-most rack displays labels detailing the arrangement of various components such as InfiniBand (IB) switches, compute nodes and Ethernet switches. Two types of IB links are present: black copper links for intra-rack connections and orange optical fiber links for inter-rack connections. The orange lines above the racks represent bundles of ten optical fiber links each. Additionally, blue, white and green (arbitrary color scheme) Ethernet cables are visible within the racks, which establish the cluster management network together with the Ethernet switches.
  • Figure 2: The structure of a small example Fat Tree (FT), Dragonfly (DF), and Slim Fly (SF), and the corresponding installations. Each topology comes with a modular design, where switches form groups (SF, DF) or pods (FT). Such groups can become racks in a physical installation.
  • Figure 3: Internal organization of a rack. The image displays a side-by-side comparison of a theoretical diagram and an actual photograph of a single rack in the cluster. The rack consists of two distinct subgroups, each housing 5 IB switches and 40 compute nodes (endpoints). Each IB switch is connected to 4 endpoints and 7 other IB switches.
  • Figure 4: Illustration of the example diagrams created from the output of our scripts, facilitating the cabling process. The diagrams show all the inter-rack connections and the corresponding ports in switches. Each switch is labeled using a triple $(S,R,I)$, where $S \in \{0,1\}$ indicates the subgroup type, $R \in \{0, ..., 4\}$ indicates the rack, and $I \in \{0, ..., 4\}$ is the consecutive switch ID within a rack/subgroup. Then, we only show ports 8--11; these ports are used to connect racks. Ports 1--4 (for endpoints) and 5--7 (for intra-rack switch-switch links) are omitted for clarity. The equations presented in \ref{['sec:sf_cons_eq']} determine which switches are connected based on the assigned labels.
  • Figure 5: Layered routing in FatPaths and in this work. Traffic is divided and sent using different layers. Our scheme relaxes the requirement in FatPaths for all layers to be trees, as in our scheme deadlock resolution is decoupled from layer creation. This ensures more flexibility in developing layers, leading to more throughput. Specifically, while in FatPaths, paths in different layers often overlap (cf. Layer 1 and 2), our routing alleviates this issue and reduces overlap/congestion and increases performance.
  • ...and 16 more figures