Table of Contents
Fetching ...

Study of Workload Interference with Intelligent Routing on Dragonfly

Yao Kang, Xin Wang, Zhiling Lan

TL;DR

This paper addresses workload interference on Dragonfly interconnects in exascale HPC by comparing reinforcement-learning–based Q-adaptive routing with conventional adaptive routing using high-fidelity flit-level SST simulations on a 1,056-node Dragonfly. It introduces an enhanced SST toolkit, nine representative HPC/ML workloads, and two metrics for communication intensity, then conducts extensive pairwise and mixed-workload analyses. The results show that Q-adaptive routing can substantially reduce interference, improve tail-latency control, and increase system throughput by balancing network traffic and reducing hot spots, with improvements up to around 42% in communication time in some cases. The work provides practical guidance for routing in shared Dragonfly networks and offers open-source tooling to enable further exascale research and optimization.

Abstract

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload performance is often offset by network contention. Recently, intelligent routing built on reinforcement learning demonstrates higher network throughput with lower packet latency. However, its effectiveness in reducing workload interference is unknown. In this work, we present extensive network simulations to study multi-workload contention under different routing mechanisms, intelligent routing and adaptive routing, on a large-scale Dragonfly system. We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.

Study of Workload Interference with Intelligent Routing on Dragonfly

TL;DR

This paper addresses workload interference on Dragonfly interconnects in exascale HPC by comparing reinforcement-learning–based Q-adaptive routing with conventional adaptive routing using high-fidelity flit-level SST simulations on a 1,056-node Dragonfly. It introduces an enhanced SST toolkit, nine representative HPC/ML workloads, and two metrics for communication intensity, then conducts extensive pairwise and mixed-workload analyses. The results show that Q-adaptive routing can substantially reduce interference, improve tail-latency control, and increase system throughput by balancing network traffic and reducing hot spots, with improvements up to around 42% in communication time in some cases. The work provides practical guidance for routing in shared Dragonfly networks and offers open-source tooling to enable further exascale research and optimization.

Abstract

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload performance is often offset by network contention. Recently, intelligent routing built on reinforcement learning demonstrates higher network throughput with lower packet latency. However, its effectiveness in reducing workload interference is unknown. In this work, we present extensive network simulations to study multi-workload contention under different routing mechanisms, intelligent routing and adaptive routing, on a large-scale Dragonfly system. We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.
Paper Structure (16 sections, 13 figures, 2 tables)

This paper contains 16 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: The hierarchical structure of 1,056-node Dragonfly system.
  • Figure 2: Q-adaptive routing and its two-level Q-table per router. Router X follows four steps to forward a packet and update its table. Through the process of sending packets and receiving feedback signals from the neighbors, every router learns the system-wide network condition and records the knowledge in its two-level Q-table for packet forwarding.
  • Figure 3: Enhancing SST for workload interference analysis. Our enhancements are shaded in green.
  • Figure 4: Average communication time (bar) and standard deviation (vertical line) of a target application over all processes. The results for six target applications are presented in (a)-(f). For a target application, each colored bar indicates its communication time under a background application (or none).
  • Figure 5: Network throughput of FFT3D and Halo3D along simulated time. Q-adaptive protects FFT3D's performance from Halo3D's interference with 2.58x higher throughput compared with that of PAR shown in green lines.
  • ...and 8 more figures