Table of Contents
Fetching ...

Trace-based, time-resolved analysis of MPI application performance using standard metrics

Kingshuk Haldar

TL;DR

The paper tackles the challenge of extracting meaningful, time-local performance insights from large MPI trace files, where traditional aggregated metrics obscure transient bottlenecks. It introduces a post-mmortem framework that discretises execution traces into time windows and reconstructs local MPI metrics (load balance, serialisation, transfer) using extended Lamport clocks to preserve MPI causality. Implemented as ClockTalk, the approach processes Paraver traces, handles clock inconsistencies and incorrect message matching, and interpolation at window boundaries to produce stable time-resolved metrics. Evaluations on synthetic benchmarks and real-world HPC codes (LaMEM, ls1-MarDyn) demonstrate that time-resolved metrics reveal localized bottlenecks and I/O effects that global aggregates miss, offering a lightweight, scalable alternative when full trace visualisation is impractical.

Abstract

Detailed trace analysis of MPI applications is essential for performance engineering, but growing trace sizes and complex communication behaviour often render comprehensive visual inspection impractical. This work presents a trace-based calculation of time-resolved values of standard MPI performance metrics, load balance, serialisation, and transfer efficiency, by discretising execution traces into fixed or adaptive time segments. The implementation processes Paraver traces postmortem, reconstructing critical execution paths and handling common event anomalies, such as clock inconsistencies and unmatched MPI events, to robustly calculate metrics for each segment. The calculated per-window metric values expose transient performance bottlenecks that the timeaggregated metrics from existing tools may conceal. Evaluations on a synthetic benchmark and real-world applications (LaMEM and ls1-MarDyn) demonstrate how time-resolved metrics reveal localised performance bottlenecks obscured by global aggregates, offering a lightweight and scalable alternative even when trace visualisation is impractical.

Trace-based, time-resolved analysis of MPI application performance using standard metrics

TL;DR

The paper tackles the challenge of extracting meaningful, time-local performance insights from large MPI trace files, where traditional aggregated metrics obscure transient bottlenecks. It introduces a post-mmortem framework that discretises execution traces into time windows and reconstructs local MPI metrics (load balance, serialisation, transfer) using extended Lamport clocks to preserve MPI causality. Implemented as ClockTalk, the approach processes Paraver traces, handles clock inconsistencies and incorrect message matching, and interpolation at window boundaries to produce stable time-resolved metrics. Evaluations on synthetic benchmarks and real-world HPC codes (LaMEM, ls1-MarDyn) demonstrate that time-resolved metrics reveal localized bottlenecks and I/O effects that global aggregates miss, offering a lightweight, scalable alternative when full trace visualisation is impractical.

Abstract

Detailed trace analysis of MPI applications is essential for performance engineering, but growing trace sizes and complex communication behaviour often render comprehensive visual inspection impractical. This work presents a trace-based calculation of time-resolved values of standard MPI performance metrics, load balance, serialisation, and transfer efficiency, by discretising execution traces into fixed or adaptive time segments. The implementation processes Paraver traces postmortem, reconstructing critical execution paths and handling common event anomalies, such as clock inconsistencies and unmatched MPI events, to robustly calculate metrics for each segment. The calculated per-window metric values expose transient performance bottlenecks that the timeaggregated metrics from existing tools may conceal. Evaluations on a synthetic benchmark and real-world applications (LaMEM and ls1-MarDyn) demonstrate how time-resolved metrics reveal localised performance bottlenecks obscured by global aggregates, offering a lightweight and scalable alternative even when trace visualisation is impractical.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Conceptual placement of the framework within existing MPI performance analysis workflows. It bridges the gap between exhaustive trace visualisation and globally aggregated scalar metrics by enabling time-resolved metric views. These time-series plots help analysts identify localised inefficiencies that would otherwise be hidden in global values.
  • Figure 2: Timeline discretisation overview and discretised metrics calculation framework.
  • Figure 3: The event points after critical path analysis resolve durations.
  • Figure 4: A compute followed by an MPI region with a discretisation boundary through it: $\Delta T_{oom}$ only contributes in the first window; $\Delta T_{critical}$ and $\Delta T_{elapsed}$ contribute to both windows as per the model.
  • Figure 5: 2-D stencil benchmark: Time-resolved metric plots with different window sizes. Smaller windows reveal transient bottlenecks (spikes) that are averaged out in longer windows, demonstrating the benefit of fine-grained temporal analysis. The dashed lines are the metrics for the entire execution duration.
  • ...and 3 more figures