Optimizing Performance on Trinity Utilizing Machine Learning, Proxy Applications and Scheduling Priorities
Phil Romero
TL;DR
Problem: large-scale Trinity clusters are increasingly bottlenecked by slow nodes. Approach: combine fast proxy applications (MPI/OpenMP) with ML and multivariate outlier detection to identify underperforming nodes efficiently. Key contributions: a proxy-app suite that mimics HPL performance, a regression mapping to HPL, Mahalanobis-distance based outlier detection, and exploration of neural-network regression and boosting; Findings: simple regression flagged about 12 of 33 outliers, Mahalanobis identified around 20, and neural nets provide outlier probability, while Random Forests struggled due to extreme class imbalance. Significance: provides a practical, data-driven framework to guide scheduling and maintenance to maximize cluster efficiency and minimize downtime.
Abstract
The sheer number of nodes continues to increase in todays supercomputers, the first half of Trinity alone contains more than 9400 compute nodes. Since the speed of todays clusters are limited by the slowest nodes, it more important than ever to identify slow nodes, improve their performance if it can be done, and assure minimal usage of slower nodes during performance critical runs. This is an ongoing maintenance task that occurs on a regular basis and, therefore, it is important to minimize the impact upon its users by assessing and addressing slow performing nodes and mitigating their consequences while minimizing down time. These issues can be solved, in large part, through a systematic application of fast running hardware assessment tests, the application of Machine Learning, and making use of performance data to increase efficiency of large clusters. Proxy applications utilizing both MPI and OpenMP were developed to produce data as a substitute for long runtime applications to evaluate node performance. Machine learning is applied to identify underperforming nodes, and policies are being discussed to both minimize the impact of underperforming nodes and increase the efficiency of the system. In this paper, I will describe the process used to produce quickly performing proxy tests, consider various methods to isolate the outliers, and produce ordered lists for use in scheduling to accomplish this task.
