Table of Contents
Fetching ...

Union: An Automatic Workload Manager for Accelerating Network Simulation

Xin Wang, Misbah Mubarak, Yao Kang, Robert B. Ross, Zhiling Lan

TL;DR

The paper addresses the challenge of evaluating co-running HPC and ML workloads on large dragonfly interconnects. It introduces Union, an automatic workload manager that translates coNCePTuaL descriptions into in situ skeletons for CODES and coordinates their execution. Through large-scale simulations on 1D and 2D dragonfly networks, the study reveals that message latency is a robust indicator of network interference, while ML performance is more tied to a node's communication time; grouping strategies and adaptive routing mitigate interference. The work provides practical guidance for schedulers and runtimes and releases Union as open-source to enable further hybrid workload studies.

Abstract

With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.

Union: An Automatic Workload Manager for Accelerating Network Simulation

TL;DR

The paper addresses the challenge of evaluating co-running HPC and ML workloads on large dragonfly interconnects. It introduces Union, an automatic workload manager that translates coNCePTuaL descriptions into in situ skeletons for CODES and coordinates their execution. Through large-scale simulations on 1D and 2D dragonfly networks, the study reveals that message latency is a robust indicator of network interference, while ML performance is more tied to a node's communication time; grouping strategies and adaptive routing mitigate interference. The work provides practical guidance for schedulers and runtimes and releases Union as open-source to enable further hybrid workload studies.

Abstract

With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.
Paper Structure (22 sections, 9 figures, 6 tables)

This paper contains 22 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A sample coNCePTuaL code for Ping-Pong test.
  • Figure 2: Overview of CODES
  • Figure 3: Diagram of in situ simulation framework with Union
  • Figure 4: Structure that defines a Union skeleton object.
  • Figure 5: An example code snippet of a Union skeleton generated from the Ping-Pong test in Figure \ref{['fig:conc-code']}. Line 23 handles parsing of command line (line 5-8 of Figure \ref{['fig:conc-code']}) and initialization of event queues (line 11-17 of Figure \ref{['fig:conc-code']}). Line 24 processes all the events of the ping-pong application shown in Figure \ref{['fig:conc-code']}. Line 6-13 intercept and translate the communication operations (line 13 & 14 of Figure \ref{['fig:conc-code']}) to Union message passing interfaces. Some portions of the code are skipped due to space limit.
  • ...and 4 more figures