MPI Implementation Profiling for Better Application Performance

Riley Shipley; Garrett Hooten; David Boehme; Derek Schafer; Anthony Skjellum; Olga Pearce

MPI Implementation Profiling for Better Application Performance

Riley Shipley, Garrett Hooten, David Boehme, Derek Schafer, Anthony Skjellum, Olga Pearce

TL;DR

The paper tackles the challenge of MPI profiling by introducing two approaches: comparison-based profiling, which contrasts the performance of a baseline MPI implementation against an experimental one, and timeline profiling, which instruments MPI regions to produce interactive traces. The methods are demonstrated on ExaMPI using the COMB benchmark, with Caliper and Hatchet enabling data analysis. Results show that addressing core scheduling issues lifts ExaMPI's MPI-call performance to surpass a reference implementation ($3.58\times$ mean speedup) and reduces COMB runtime by $44.66\%$, while timeline profiling reveals and mitigates thread-contention bottlenecks. Together, the approaches provide practical guidelines for MPI implementers and users to diagnose, prioritize, and realize communication-performance optimizations, with broad applicability beyond the tested stack.

Abstract

While application profiling has been a mainstay in the HPC community for years, profiling of MPI and other communication middleware has not received the same degree of exploration. This paper adds to the discussion of MPI profiling, contributing two general-purpose profiling methods as well as practical applications of these methods to an existing implementation. The ability to detect performance defects in MPI codes using these methods increases the potential of further research and development in communication optimization.

MPI Implementation Profiling for Better Application Performance

TL;DR

mean speedup) and reduces COMB runtime by

, while timeline profiling reveals and mitigates thread-contention bottlenecks. Together, the approaches provide practical guidelines for MPI implementers and users to diagnose, prioritize, and realize communication-performance optimizations, with broad applicability beyond the tested stack.

Abstract

Paper Structure (15 sections, 11 figures)

This paper contains 15 sections, 11 figures.

Introduction
Libraries & Applications
ExaMPI
Caliper
COMB
Comparison-based Profiling
Methodology
Experimentation
Results
Timeline Profiling
Methodology
Experimentation
Results
Conclusions
Future Work

Figures (11)

Figure 1: Top of a Hatchet tree containing average completion times of ExaMPI at the start of the experiment
Figure 2: Portion of a Hatchet tree showing ExaMPI's lower performance relative to Spectrum on both communication and computation tasks
Figure 3: Portion of a Hatchet tree showing the improved ExaMPI's performance relative to Spectrum
Figure 4: Comparison of ExaMPI before and after core scheduling changes to Spectrum
Figure 5: Comparison of COMB completion times between all 3 implementations
...and 6 more figures

MPI Implementation Profiling for Better Application Performance

TL;DR

Abstract

MPI Implementation Profiling for Better Application Performance

Authors

TL;DR

Abstract

Table of Contents

Figures (11)