Table of Contents
Fetching ...

Analysing Mechanisms for Virtual Channel Management in Low-Diameter networks

Alejandro Cano, Cristóbal Camarero, Carmen Martínez, Ramón Beivide

TL;DR

This work addresses deadlocks in non-minimal Valiant routing on low-diameter networks (HX, Dragonfly, Dragonfly+) by evaluating how virtual-channel (VC) management policies affect throughput and stability. It systematically compares Two-Phase VC management (2Phases) and Ladder-based schemes (including Ladder with reuse) across topologies and adversarial traffic, using the CAMINOS simulator to reveal topology-specific performance and instability regimes. Key findings show that Ladder-based VC management—especially with VC reuse—provides the most stable and high-throughput behavior in HyperX and Dragonfly networks, while Dragonfly+ benefits most from a 2Phases MinFirst configuration; misconfigured VC policies can trigger severe HoLB and oscillations, degrading performance by orders of magnitude. The results offer practical design guidelines for deadlock-free, high-throughput routing in data centers and HPC interconnects and motivate future work integrating VC policies with in-transit adaptive routing and more dynamic VC allocation strategies.

Abstract

To interconnect their growing number of servers, current supercomputers and data centers are starting to adopt low-diameter networks, such as HyperX, Dragonfly and Dragonfly+. These emergent topologies require balancing the load over their links and finding suitable non-minimal routing mechanisms for them becomes particularly challenging. The Valiant load balancing scheme is a very popular choice for non-minimal routing. Evolved adaptive routing mechanisms implemented in real systems are based on this Valiant scheme. All these low-diameter networks are deadlock-prone when non-minimal routing is employed. Routing deadlocks occur when packets cannot progress due to cyclic dependencies. Therefore, developing efficient deadlock-free packet routing mechanisms is critical for the progress of these emergent networks. The routing function includes the routing algorithm for path selection and the buffers management policy that dictates how packets allocate the buffers of the switches on their paths. For the same routing algorithm, a different buffer management mechanism can lead to a very different performance. Moreover, certain mechanisms considered efficient for avoiding deadlocks, may still suffer from hard to pinpoint instabilities that make erratic the network response. This paper focuses on exploring the impact of these buffers management policies on the performance of current interconnection networks, showing a 90\% of performance drop if an incorrect buffers management policy is used. Moreover, this study not only characterizes some of these undesirable scenarios but also proposes practicable solutions.

Analysing Mechanisms for Virtual Channel Management in Low-Diameter networks

TL;DR

This work addresses deadlocks in non-minimal Valiant routing on low-diameter networks (HX, Dragonfly, Dragonfly+) by evaluating how virtual-channel (VC) management policies affect throughput and stability. It systematically compares Two-Phase VC management (2Phases) and Ladder-based schemes (including Ladder with reuse) across topologies and adversarial traffic, using the CAMINOS simulator to reveal topology-specific performance and instability regimes. Key findings show that Ladder-based VC management—especially with VC reuse—provides the most stable and high-throughput behavior in HyperX and Dragonfly networks, while Dragonfly+ benefits most from a 2Phases MinFirst configuration; misconfigured VC policies can trigger severe HoLB and oscillations, degrading performance by orders of magnitude. The results offer practical design guidelines for deadlock-free, high-throughput routing in data centers and HPC interconnects and motivate future work integrating VC policies with in-transit adaptive routing and more dynamic VC allocation strategies.

Abstract

To interconnect their growing number of servers, current supercomputers and data centers are starting to adopt low-diameter networks, such as HyperX, Dragonfly and Dragonfly+. These emergent topologies require balancing the load over their links and finding suitable non-minimal routing mechanisms for them becomes particularly challenging. The Valiant load balancing scheme is a very popular choice for non-minimal routing. Evolved adaptive routing mechanisms implemented in real systems are based on this Valiant scheme. All these low-diameter networks are deadlock-prone when non-minimal routing is employed. Routing deadlocks occur when packets cannot progress due to cyclic dependencies. Therefore, developing efficient deadlock-free packet routing mechanisms is critical for the progress of these emergent networks. The routing function includes the routing algorithm for path selection and the buffers management policy that dictates how packets allocate the buffers of the switches on their paths. For the same routing algorithm, a different buffer management mechanism can lead to a very different performance. Moreover, certain mechanisms considered efficient for avoiding deadlocks, may still suffer from hard to pinpoint instabilities that make erratic the network response. This paper focuses on exploring the impact of these buffers management policies on the performance of current interconnection networks, showing a 90\% of performance drop if an incorrect buffers management policy is used. Moreover, this study not only characterizes some of these undesirable scenarios but also proposes practicable solutions.
Paper Structure (12 sections, 9 figures, 3 tables)

This paper contains 12 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Small instances of the studied topologies: 2D HyperX, Dragonfly and Dragonfly+ respectively. Switches are represented as solid rectangles and servers are omitted. The links coming out of a selected group are in bold.
  • Figure 2: Available VCs (in gray) for a packet that has traveled a given number of hops in Ladder (left) and Ladder with reused VCs (right).
  • Figure 3: VCs allowed for injection depending on the VC management policy. Ladder injects at VC 0 (dark green); MinFirst inject at first phase (0--1, dark green); MinLast inject most commonly at first phase (0--1, dark green) and sporadically at the second phase (2--3, light green).
  • Figure 4: Crafted examples of accepted load for each of the three Categories of stability.
  • Figure 5: Temporal performance of the MinLast mechanism across the 2D HyperX, Dragonfly and Dragonfly+ topologies respectively. The offered load parameter for the simulation is 1.0. The three charts show a temporal simulation with ten different runs using ten different colors, to appreciate the inestabilities which one could find in this mechanism.
  • ...and 4 more figures