Leveraging InfiniBand Controller to Configure Deadlock-Free Routing Engines for Dragonflies
German Maglione-Mathey, Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles, Eitan Zahavi
TL;DR
This work tackles deadlock-free routing in Dragonfly-infibniBand networks by implementing Kim and Dally's minimal routing algorithm (DLA) as an OpenSM routing engine. It overcomes InfiniBand VL/SL2VL constraints with an asymmetric SL2VL configuration, enabling DLA to operate in a real OpenSM environment for fully-connected Dragonflies. Through extensive simulations and a 42-node CELLIA cluster, DLA demonstrates competitive or superior throughput with minimal resources (1 SL and 2 VLs) compared to other engines, particularly when VOQ is available. The study confirms the practicality of deploying DLA in OpenSM and lays out its architectural trade-offs and scope for future expansion beyond fully-connected Dragonflies.
Abstract
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
