MPI Implementation Profiling for Better Application Performance
Riley Shipley, Garrett Hooten, David Boehme, Derek Schafer, Anthony Skjellum, Olga Pearce
TL;DR
The paper tackles the challenge of MPI profiling by introducing two approaches: comparison-based profiling, which contrasts the performance of a baseline MPI implementation against an experimental one, and timeline profiling, which instruments MPI regions to produce interactive traces. The methods are demonstrated on ExaMPI using the COMB benchmark, with Caliper and Hatchet enabling data analysis. Results show that addressing core scheduling issues lifts ExaMPI's MPI-call performance to surpass a reference implementation ($3.58\times$ mean speedup) and reduces COMB runtime by $44.66\%$, while timeline profiling reveals and mitigates thread-contention bottlenecks. Together, the approaches provide practical guidelines for MPI implementers and users to diagnose, prioritize, and realize communication-performance optimizations, with broad applicability beyond the tested stack.
Abstract
While application profiling has been a mainstay in the HPC community for years, profiling of MPI and other communication middleware has not received the same degree of exploration. This paper adds to the discussion of MPI profiling, contributing two general-purpose profiling methods as well as practical applications of these methods to an existing implementation. The ability to detect performance defects in MPI codes using these methods increases the potential of further research and development in communication optimization.
