ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Yiheng Xu; Pranav Sivaraman; Hariharan Devarajan; Kathryn Mohror; Abhinav Bhatele

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Yiheng Xu, Pranav Sivaraman, Hariharan Devarajan, Kathryn Mohror, Abhinav Bhatele

TL;DR

The paper addresses the challenge of selecting storage sub-systems for HPC I/O, particularly burst buffers, by introducing PrismIO, a Python-based tool for detailed I/O trace analysis, and a per-file ML workflow to predict the optimal subsystem. Using IOR-driven data from Lassen, the authors build a 19-feature dataset and train a classifier (best: Decision Tree) that achieves ~94–96% accuracy on unseen IOR configurations and production-app traces. Key contributions include PrismIO’s API suite for feature extraction and visualization, a large-scale I/O characterization dataset, and an end-to-end workflow that outputs per-file placement decisions. The proposed approach enables automated, scalable optimization of burst buffer usage, with potential to extend to other platforms and to speed up feature extraction for large-scale runs.

Abstract

Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%.

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

TL;DR

Abstract

Paper Structure (23 sections, 17 figures, 1 table)

This paper contains 23 sections, 17 figures, 1 table.

Introduction
Background & related work
Burst buffers
I/O performance analysis tools
Data collection
HPC machine and storage system
The IOR benchmark and its configuration
Data processing
PrismIO: An I/O performance analysis tool
Data structure
API functions for analyzing I/O performance
Feature extraction API
Visualization API
Case studies
Extreme runs analysis
...and 8 more sections

Figures (17)

Figure 1: Comparison of I/O bandwidth for different transfer sizes when IOR uses Lassen GPFS versus burst buffers. Depending on the transfer size, BBs do not always achieve better I/O performance than GPFS.
Figure 2: I/O bandwidth of ranks 0--3 for each file they read (in a sample IOR trace.) It returns a dataframe with a hierarchical index that shows the read bandwidth of rank 0-3 to different files. It utilizes the filter option to only report read bandwidth. The function automatically selects the most readable unit for bandwidth, in this case GB/s.
Figure 3: Time spent in metadata operations by ranks 0--3 for each file they read (in a sample IOR trace.) In addition to absolute time, it also reports the percentage of total time spent by a process performing I/O.
Figure 4: A screenshot of a part of the DataFrame when using shared_files for a sample trace. It demonstrates how files are shared across processes. Users can easily observe some files are not shared and some files are shared by 32 ranks.
Figure 5: A screenshot of a part of the DataFrame when using access_pattern for a sample trace. It counts the number of different access types. It has 2 random accesses on files and no other type of access. From the count users can know what kind of access the run is mostly doing to files, and thus decide the access pattern for files.
...and 12 more figures

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

TL;DR

Abstract

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Authors

TL;DR

Abstract

Table of Contents

Figures (17)