Table of Contents
Fetching ...

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

Siyuan Shen, Langwen Huang, Marcin Chrapek, Timo Schneider, Jai Dayal, Manisha Gajbe, Robert Wisniewski, Torsten Hoefler

TL;DR

LLAMP addresses the problem of varying network latency tolerance in HPC MPI applications by marrying the LogGPS communication model with linear programming to produce latency sensitivity $\lambda_L$ and latency tolerance metrics. It builds execution graphs from MPI traces, formulates a linear program that captures path costs under the LogGPS model, and uses reduced costs to extract $\lambda_L$ and other performance metrics. The approach achieves typically <2% relative prediction error across benchmarks such as MILC, LULESH, LAMMPS, and demonstrates applicability to ICON for evaluating collective algorithms and network topologies. This analytical framework enables network architects and developers to optimize HPC deployments without resorting to hardware testing or slow simulators, offering actionable insights into latency-overlap opportunities and topology choices.

Abstract

The shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. As large-scale MPI applications often exhibit significant differences in their network latency tolerance, it is crucial to accurately determine the extent of network latency an application can withstand without significant performance degradation. Current approaches to assessing this metric often rely on specialized hardware or network simulators, which can be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain that offers an efficient, analytical approach to evaluating HPC applications' network latency tolerance using the LogGPS model and linear programming. LLAMP equips software developers and network architects with essential insights for optimizing HPC infrastructures and strategically deploying applications to minimize latency impacts. Through our validation on a variety of MPI applications like MILC, LULESH, and LAMMPS, we demonstrate our tool's high accuracy, with relative prediction errors generally below 2%. Additionally, we include a case study of the ICON weather and climate model to illustrate LLAMP's broad applicability in evaluating collective algorithms and network topologies.

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

TL;DR

LLAMP addresses the problem of varying network latency tolerance in HPC MPI applications by marrying the LogGPS communication model with linear programming to produce latency sensitivity and latency tolerance metrics. It builds execution graphs from MPI traces, formulates a linear program that captures path costs under the LogGPS model, and uses reduced costs to extract and other performance metrics. The approach achieves typically <2% relative prediction error across benchmarks such as MILC, LULESH, LAMMPS, and demonstrates applicability to ICON for evaluating collective algorithms and network topologies. This analytical framework enables network architects and developers to optimize HPC deployments without resorting to hardware testing or slow simulators, offering actionable insights into latency-overlap opportunities and topology choices.

Abstract

The shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. As large-scale MPI applications often exhibit significant differences in their network latency tolerance, it is crucial to accurately determine the extent of network latency an application can withstand without significant performance degradation. Current approaches to assessing this metric often rely on specialized hardware or network simulators, which can be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain that offers an efficient, analytical approach to evaluating HPC applications' network latency tolerance using the LogGPS model and linear programming. LLAMP equips software developers and network architects with essential insights for optimizing HPC infrastructures and strategically deploying applications to minimize latency impacts. Through our validation on a variety of MPI applications like MILC, LULESH, and LAMMPS, we demonstrate our tool's high accuracy, with relative prediction errors generally below 2%. Additionally, we include a case study of the ICON weather and climate model to illustrate LLAMP's broad applicability in evaluating collective algorithms and network topologies.
Paper Structure (42 sections, 8 equations, 20 figures, 2 tables, 3 algorithms)

This paper contains 42 sections, 8 equations, 20 figures, 2 tables, 3 algorithms.

Figures (20)

  • Figure 1: An example demonstrating varying degrees of network latency tolerance among traditional HPC applications, namely MILC, LULESH, and ICON. The green, orange, and red zones correspond to the maximum network latencies before observing a performance degradation of 1%, 2%, and 5%, respectively. The comparison between measured and predicted runtime showcases the predictive accuracy of our toolchain. The tolerance intervals are calculated directly by our tool.
  • Figure 2: High-level overview of the LLAMP toolchain.
  • Figure 3: An example illustrating the transformation of blocking p2p operations into an execution graph, assuming that the eager protocol is used. lists collected traces with only the start and end timestamps. shows the corresponding space-time diagram. In , calc vertices are marked in green while send and recv vertices are in red.
  • Figure 4: An example demonstrating that the network latency sensitivity of a program is determined by the number of messages along the critical path of the graph and is also dependent on the value of parameter $L$ itself.
  • Figure 5: Visualization of Equation \ref{['eq:lp-model-example']}. Dotted green lines define the linear constraints. Shaded areas mark the infeasible region. Blue lines highlight the borders of the feasible region.
  • ...and 15 more figures