Table of Contents
Fetching ...

Modeling and Analysis of Application Interference on Dragonfly+

Yao Kang, Xin Wang, Neil McGlohon, Misbah Mubarak, Sudheer Chunduri, Zhiling Lan

TL;DR

This study quantitatively evaluates a variety of application communication interactions on a 3,456-node Dragonfly+ system by using the CODES toolkit and shows that intra-job interference can cause severe performance degradation for communication-intensive applications.

Abstract

Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance variability due to their hierarchical architecture and resource sharing design. Event-driven network simulators are indispensable tools for navigating complex system design. In this study, we quantitatively evaluate a variety of application communication interactions on a 3,456-node Dragonfly+ system by using the CODES toolkit. This study looks at the impact of communication interference from a user's perspective. Specifically, for a given application submitted by a user, we examine how this application will behave with the existing workload running in the system under different job placement policies. Our simulation study considers hundreds of experiment configurations including four target applications with representative communication patterns under a variety of network traffic conditions. Our study shows that intra-job interference can cause severe performance degradation for communication-intensive applications. Inter-job interference can generally be reduced for applications with one-to-one or one-to-many communication patterns through job isolation. Application with one-to-all communication pattern is resilient to network interference.

Modeling and Analysis of Application Interference on Dragonfly+

TL;DR

This study quantitatively evaluates a variety of application communication interactions on a 3,456-node Dragonfly+ system by using the CODES toolkit and shows that intra-job interference can cause severe performance degradation for communication-intensive applications.

Abstract

Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance variability due to their hierarchical architecture and resource sharing design. Event-driven network simulators are indispensable tools for navigating complex system design. In this study, we quantitatively evaluate a variety of application communication interactions on a 3,456-node Dragonfly+ system by using the CODES toolkit. This study looks at the impact of communication interference from a user's perspective. Specifically, for a given application submitted by a user, we examine how this application will behave with the existing workload running in the system under different job placement policies. Our simulation study considers hundreds of experiment configurations including four target applications with representative communication patterns under a variety of network traffic conditions. Our study shows that intra-job interference can cause severe performance degradation for communication-intensive applications. Inter-job interference can generally be reduced for applications with one-to-one or one-to-many communication patterns through job isolation. Application with one-to-all communication pattern is resilient to network interference.
Paper Structure (21 sections, 4 equations, 11 figures, 2 tables)

This paper contains 21 sections, 4 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: A 3,456-node Dragonfly+ system
  • Figure 2: Global link connection between group $i$ and group $j$, where $0\leq i < j \leq 8$.
  • Figure 3: Fully Progressive Adaptive Routing Paths
  • Figure 4: (a) Background traffic generated among three groups. (b) Group 0 with highlighted nodes holding background application processes
  • Figure 5: Message latency of the target application with Uniform Random (UR) communication pattern. The top row depicts the use of contiguous job placement of the target application without group overlapping with the background application. The bottom row shows the result of the target application shares groups with the background application under the random placement. Message latency under different background application loads is identified by its color, where baseline is the target application executed solely on the system.
  • ...and 6 more figures