Table of Contents
Fetching ...

Optimizing Performance on Trinity Utilizing Machine Learning, Proxy Applications and Scheduling Priorities

Phil Romero

TL;DR

Problem: large-scale Trinity clusters are increasingly bottlenecked by slow nodes. Approach: combine fast proxy applications (MPI/OpenMP) with ML and multivariate outlier detection to identify underperforming nodes efficiently. Key contributions: a proxy-app suite that mimics HPL performance, a regression mapping to HPL, Mahalanobis-distance based outlier detection, and exploration of neural-network regression and boosting; Findings: simple regression flagged about 12 of 33 outliers, Mahalanobis identified around 20, and neural nets provide outlier probability, while Random Forests struggled due to extreme class imbalance. Significance: provides a practical, data-driven framework to guide scheduling and maintenance to maximize cluster efficiency and minimize downtime.

Abstract

The sheer number of nodes continues to increase in todays supercomputers, the first half of Trinity alone contains more than 9400 compute nodes. Since the speed of todays clusters are limited by the slowest nodes, it more important than ever to identify slow nodes, improve their performance if it can be done, and assure minimal usage of slower nodes during performance critical runs. This is an ongoing maintenance task that occurs on a regular basis and, therefore, it is important to minimize the impact upon its users by assessing and addressing slow performing nodes and mitigating their consequences while minimizing down time. These issues can be solved, in large part, through a systematic application of fast running hardware assessment tests, the application of Machine Learning, and making use of performance data to increase efficiency of large clusters. Proxy applications utilizing both MPI and OpenMP were developed to produce data as a substitute for long runtime applications to evaluate node performance. Machine learning is applied to identify underperforming nodes, and policies are being discussed to both minimize the impact of underperforming nodes and increase the efficiency of the system. In this paper, I will describe the process used to produce quickly performing proxy tests, consider various methods to isolate the outliers, and produce ordered lists for use in scheduling to accomplish this task.

Optimizing Performance on Trinity Utilizing Machine Learning, Proxy Applications and Scheduling Priorities

TL;DR

Problem: large-scale Trinity clusters are increasingly bottlenecked by slow nodes. Approach: combine fast proxy applications (MPI/OpenMP) with ML and multivariate outlier detection to identify underperforming nodes efficiently. Key contributions: a proxy-app suite that mimics HPL performance, a regression mapping to HPL, Mahalanobis-distance based outlier detection, and exploration of neural-network regression and boosting; Findings: simple regression flagged about 12 of 33 outliers, Mahalanobis identified around 20, and neural nets provide outlier probability, while Random Forests struggled due to extreme class imbalance. Significance: provides a practical, data-driven framework to guide scheduling and maintenance to maximize cluster efficiency and minimize downtime.

Abstract

The sheer number of nodes continues to increase in todays supercomputers, the first half of Trinity alone contains more than 9400 compute nodes. Since the speed of todays clusters are limited by the slowest nodes, it more important than ever to identify slow nodes, improve their performance if it can be done, and assure minimal usage of slower nodes during performance critical runs. This is an ongoing maintenance task that occurs on a regular basis and, therefore, it is important to minimize the impact upon its users by assessing and addressing slow performing nodes and mitigating their consequences while minimizing down time. These issues can be solved, in large part, through a systematic application of fast running hardware assessment tests, the application of Machine Learning, and making use of performance data to increase efficiency of large clusters. Proxy applications utilizing both MPI and OpenMP were developed to produce data as a substitute for long runtime applications to evaluate node performance. Machine learning is applied to identify underperforming nodes, and policies are being discussed to both minimize the impact of underperforming nodes and increase the efficiency of the system. In this paper, I will describe the process used to produce quickly performing proxy tests, consider various methods to isolate the outliers, and produce ordered lists for use in scheduling to accomplish this task.
Paper Structure (6 sections, 1 equation, 11 figures, 3 tables)

This paper contains 6 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: This figure shows a map plot that attempts to preserve distances between the features of all items under consideration. Note the cluster labeled ”3”, it shows a markedly different shape from the clusters labeled ”1” with large outliers. The x and y axes represent distances in arbitrary space and are orthogonal.
  • Figure 2: This figure shows a performance distribution of the HPL, note the distribuition is similar to the blue line indicating a normal distribution.
  • Figure 3: This figure shows a performance distribution of the OpenMP DGEMM implementation. Note that this is not a singular peak distribution, three peaks are discernible, consequently it may be expected to segregate node performances better than an algorithm that produces a singular peak distribution.
  • Figure 4: This figure shows the performance distribution of the OpenMPbased NBODY algorthithm that utilizes fused multiply add instructions newly available for the KNL architecture. Note that the distribution has five peaks instead of just one.
  • Figure 5: This figure shows the performance distribution of an optimized memory write performance test, note there arefour peaks in the distributionand a long tail exists.
  • ...and 6 more figures