Table of Contents
Fetching ...

Capturing Periodic I/O Using Frequency Techniques

Ahmad Tarraf, Alexis Bandet, Francieli Boito, Guillaume Pallez, Felix Wolf

TL;DR

The paper tackles the challenge of capturing periodic I/O bursts in HPC workloads to reduce file-system contention. It introduces FTIO, an online method that uses discrete Fourier transform analysis of application-level IO bandwidth signals to extract the period of IO phases, accompanied by confidence metrics and optional autocorrelation refinement. The approach achieves a mean error below $11\%$ on large-scale traces and, when paired with an I/O scheduler (Set-10), yields substantial practical gains (utilization up by $26\%$ and slowdown reduction by $56\%$). FTIO is lightweight, adaptable online/offline, and open-source, with broad applicability from scheduling to burst-buffer management and beyond.

Abstract

Many HPC applications perform their I/O in bursts that follow a periodic pattern. This allows for making predictions as to when a burst occurs. System providers can take advantage of such knowledge to reduce file-system contention by actively scheduling I/O bandwidth. The effectiveness of this approach, however, depends on the ability to detect and quantify the periodicity of I/O patterns online. In this paper, we introduce FTIO, an online method to detect periodic I/O phases, which is based on discrete Fourier transform (DFT), combined with outlier detection. We provide metrics that gauge the confidence in the output and tell how far from being periodic the signal is. We validate our approach with large-scale experiments on a production system and examine its limitations extensively. Our experiments show that FTIO has a mean error below 11%. Finally, we demonstrate that FTIO allowed the I/O scheduler Set- 10 to boost system utilization by 26% and reduce I/O slowdown by 56%.

Capturing Periodic I/O Using Frequency Techniques

TL;DR

The paper tackles the challenge of capturing periodic I/O bursts in HPC workloads to reduce file-system contention. It introduces FTIO, an online method that uses discrete Fourier transform analysis of application-level IO bandwidth signals to extract the period of IO phases, accompanied by confidence metrics and optional autocorrelation refinement. The approach achieves a mean error below on large-scale traces and, when paired with an I/O scheduler (Set-10), yields substantial practical gains (utilization up by and slowdown reduction by ). FTIO is lightweight, adaptable online/offline, and open-source, with broad applicability from scheduling to burst-buffer management and beyond.

Abstract

Many HPC applications perform their I/O in bursts that follow a periodic pattern. This allows for making predictions as to when a burst occurs. System providers can take advantage of such knowledge to reduce file-system contention by actively scheduling I/O bandwidth. The effectiveness of this approach, however, depends on the ability to detect and quantify the periodicity of I/O patterns online. In this paper, we introduce FTIO, an online method to detect periodic I/O phases, which is based on discrete Fourier transform (DFT), combined with outlier detection. We provide metrics that gauge the confidence in the output and tell how far from being periodic the signal is. We validate our approach with large-scale experiments on a production system and examine its limitations extensively. Our experiments show that FTIO has a mean error below 11%. Finally, we demonstrate that FTIO allowed the I/O scheduler Set- 10 to boost system utilization by 26% and reduce I/O slowdown by 56%.
Paper Structure (22 sections, 10 equations, 17 figures)

This paper contains 22 sections, 10 equations, 17 figures.

Figures (17)

  • Figure 1: Difficulty of detecting I/O phases: Where does A finish? Is B one or two phases? Why don't A and B belong together?
  • Figure 2: FTIO results on IOR with 9216 ranks executed on the Lichtenberg cluster. The time behavior (top) and the normed power spectrum (bottom) are shown.
  • Figure 3: Result of autocorrelation on IOR with 9216 ranks.
  • Figure 4: A red line marks the $V(\mathcal{T})/L(\mathcal{T})$ threshold in the trace from Figure \ref{['fig:exampleIO']}. Here $R_{IO} = 0.68$ and $B_{IO}\approx11$ GB/s.
  • Figure 5: Online period prediction using FTIO.
  • ...and 12 more figures