ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems
Yiheng Xu, Pranav Sivaraman, Hariharan Devarajan, Kathryn Mohror, Abhinav Bhatele
TL;DR
The paper addresses the challenge of selecting storage sub-systems for HPC I/O, particularly burst buffers, by introducing PrismIO, a Python-based tool for detailed I/O trace analysis, and a per-file ML workflow to predict the optimal subsystem. Using IOR-driven data from Lassen, the authors build a 19-feature dataset and train a classifier (best: Decision Tree) that achieves ~94–96% accuracy on unseen IOR configurations and production-app traces. Key contributions include PrismIO’s API suite for feature extraction and visualization, a large-scale I/O characterization dataset, and an end-to-end workflow that outputs per-file placement decisions. The proposed approach enables automated, scalable optimization of burst buffer usage, with potential to extend to other platforms and to speed up feature extraction for large-scale runs.
Abstract
Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%.
