Table of Contents
Fetching ...

Leveraging InfiniBand Controller to Configure Deadlock-Free Routing Engines for Dragonflies

German Maglione-Mathey, Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles, Eitan Zahavi

TL;DR

This work tackles deadlock-free routing in Dragonfly-infibniBand networks by implementing Kim and Dally's minimal routing algorithm (DLA) as an OpenSM routing engine. It overcomes InfiniBand VL/SL2VL constraints with an asymmetric SL2VL configuration, enabling DLA to operate in a real OpenSM environment for fully-connected Dragonflies. Through extensive simulations and a 42-node CELLIA cluster, DLA demonstrates competitive or superior throughput with minimal resources (1 SL and 2 VLs) compared to other engines, particularly when VOQ is available. The study confirms the practicality of deploying DLA in OpenSM and lays out its architectural trade-offs and scope for future expansion beyond fully-connected Dragonflies.

Abstract

The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.

Leveraging InfiniBand Controller to Configure Deadlock-Free Routing Engines for Dragonflies

TL;DR

This work tackles deadlock-free routing in Dragonfly-infibniBand networks by implementing Kim and Dally's minimal routing algorithm (DLA) as an OpenSM routing engine. It overcomes InfiniBand VL/SL2VL constraints with an asymmetric SL2VL configuration, enabling DLA to operate in a real OpenSM environment for fully-connected Dragonflies. Through extensive simulations and a 42-node CELLIA cluster, DLA demonstrates competitive or superior throughput with minimal resources (1 SL and 2 VLs) compared to other engines, particularly when VOQ is available. The study confirms the practicality of deploying DLA in OpenSM and lays out its architectural trade-offs and scope for future expansion beyond fully-connected Dragonflies.

Abstract

The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.

Paper Structure

This paper contains 13 sections, 4 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Generic Dragonfly connection pattern.
  • Figure 2: The ratio $\frac{f_g}{f_l}$ is close to $1$ when the network size increases.
  • Figure 3: Inter-group path in a Dragonfly topology.
  • Figure 4: Pseudo-code of the minimal routing for Dragonflies (DLA).
  • Figure 5: Diagram of the architecture of an IB-based $k$-port switch.
  • ...and 15 more figures