Table of Contents
Fetching ...

MPI Implementation Profiling for Better Application Performance

Riley Shipley, Garrett Hooten, David Boehme, Derek Schafer, Anthony Skjellum, Olga Pearce

TL;DR

The paper tackles the challenge of MPI profiling by introducing two approaches: comparison-based profiling, which contrasts the performance of a baseline MPI implementation against an experimental one, and timeline profiling, which instruments MPI regions to produce interactive traces. The methods are demonstrated on ExaMPI using the COMB benchmark, with Caliper and Hatchet enabling data analysis. Results show that addressing core scheduling issues lifts ExaMPI's MPI-call performance to surpass a reference implementation ($3.58\times$ mean speedup) and reduces COMB runtime by $44.66\%$, while timeline profiling reveals and mitigates thread-contention bottlenecks. Together, the approaches provide practical guidelines for MPI implementers and users to diagnose, prioritize, and realize communication-performance optimizations, with broad applicability beyond the tested stack.

Abstract

While application profiling has been a mainstay in the HPC community for years, profiling of MPI and other communication middleware has not received the same degree of exploration. This paper adds to the discussion of MPI profiling, contributing two general-purpose profiling methods as well as practical applications of these methods to an existing implementation. The ability to detect performance defects in MPI codes using these methods increases the potential of further research and development in communication optimization.

MPI Implementation Profiling for Better Application Performance

TL;DR

The paper tackles the challenge of MPI profiling by introducing two approaches: comparison-based profiling, which contrasts the performance of a baseline MPI implementation against an experimental one, and timeline profiling, which instruments MPI regions to produce interactive traces. The methods are demonstrated on ExaMPI using the COMB benchmark, with Caliper and Hatchet enabling data analysis. Results show that addressing core scheduling issues lifts ExaMPI's MPI-call performance to surpass a reference implementation ( mean speedup) and reduces COMB runtime by , while timeline profiling reveals and mitigates thread-contention bottlenecks. Together, the approaches provide practical guidelines for MPI implementers and users to diagnose, prioritize, and realize communication-performance optimizations, with broad applicability beyond the tested stack.

Abstract

While application profiling has been a mainstay in the HPC community for years, profiling of MPI and other communication middleware has not received the same degree of exploration. This paper adds to the discussion of MPI profiling, contributing two general-purpose profiling methods as well as practical applications of these methods to an existing implementation. The ability to detect performance defects in MPI codes using these methods increases the potential of further research and development in communication optimization.
Paper Structure (15 sections, 11 figures)

This paper contains 15 sections, 11 figures.

Figures (11)

  • Figure 1: Top of a Hatchet tree containing average completion times of ExaMPI at the start of the experiment
  • Figure 2: Portion of a Hatchet tree showing ExaMPI's lower performance relative to Spectrum on both communication and computation tasks
  • Figure 3: Portion of a Hatchet tree showing the improved ExaMPI's performance relative to Spectrum
  • Figure 4: Comparison of ExaMPI before and after core scheduling changes to Spectrum
  • Figure 5: Comparison of COMB completion times between all 3 implementations
  • ...and 6 more figures