Automated Programmatic Performance Analysis of Parallel Programs

Onur Cankur; Aditya Tomar; Daniel Nichols; Connor Scully-Allison; Katherine E. Isaacs; Abhinav Bhatele

Automated Programmatic Performance Analysis of Parallel Programs

Onur Cankur, Aditya Tomar, Daniel Nichols, Connor Scully-Allison, Katherine E. Isaacs, Abhinav Bhatele

TL;DR

The paper tackles the challenge of analyzing large-scale parallel performance data by introducing Chopper, a Python API built on Hatchet that enables high-level, configurable analyses for both single-run and multi-run executions. It provides programmatic tools to detect load imbalance, hot paths, scalability bottlenecks, and metric–CCT correlations, along with variability analysis, using a unified interface that easily integrates with existing Python visualization stacks. The authors implement a range of single-run and multi-run capabilities, demonstrate reading multiple profiles, unifying GraphFrames, and computing per-CCT node efficiency and speedup across diverse HPC apps (AMG, Laghos, LULESH, Quicksilver, Tortuga). Through case studies and API performance tests, the work shows that Chopper significantly reduces developer effort and enables scalable, reproducible performance analysis workflows. The work suggests future enhancements in predictive modeling for correlation, customizable plotting, and GPU performance analysis to broaden applicability.

Abstract

Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API, Chopper, which provides high-level and flexible performance analysis for both single and multiple executions of parallel applications. Chopper facilitates performance analysis and reduces developer effort by providing configurable high-level methods for common performance analysis tasks such as calculating load imbalance, hot paths, scalability bottlenecks, correlation between metrics and CCT nodes, and causes of performance variability within a robust and mature Python environment that provides fluid access to lower-level data manipulations. We demonstrate how Chopper allows developers to quickly and succinctly explore performance and identify issues across applications such as AMG, Laghos, LULESH, Quicksilver and Tortuga.

Automated Programmatic Performance Analysis of Parallel Programs

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 3 equations, 13 figures, 1 table, 1 algorithm.

Introduction
Background and Related Work
Profiling and Call Graphs
Hatchet
Common Performance Analysis Problems
Related Work
Performance Analysis Tools
Simplifying Performance Analysis Tasks
Chopper: A Python API for Performance Analysis
Analyzing a Single Execution
Comparing Multiple Executions
Experimental Setup
Performance Evaluation of Chopper
API Performance for Single Executions
Case Studies
...and 3 more sections

Figures (13)

Figure 1: Creating a callgraph from a CCT using the to_callgraph function. Hatchet's Jupyter notebook visualization is used to visualize the CCT (a). The call graph (b) is visualized externally.
Figure 2: Calculating the load imbalance of a 512 process execution for LULESH by using the load_imbalance function. The resulting DataFrame is sorted by the time.imbalance column which shows the imbalance value for each CCT node.
Figure 3: Identifying the hot path of a simple CCT using the hot_path function in Chopper. The red-colored path with bigger, labeled nodes represents the hot path.
Figure 4: The multirun_analysis function returns a pivot table containing node names and time values of the nodes in each profile. We show a truncated example of the returned pivot table from a set of LULESH weak scaling executions (64, 125, 216, and 512 processes).
Figure 5: GraphFrames before and after unification by the unify_multiple_graphframes function. The resulting GraphFrames include all nodes from the given GraphFrames but retain their original metric values.
...and 8 more figures

Automated Programmatic Performance Analysis of Parallel Programs

TL;DR

Abstract

Automated Programmatic Performance Analysis of Parallel Programs

Authors

TL;DR

Abstract

Table of Contents

Figures (13)