Table of Contents
Fetching ...

Mitigating Shared Storage Congestion Using Control Theory

Thomas Collignon, Kouds Halitim, Raphaël Bleuse, Sophie Cerf, Bogdan Robu, Éric Rutten, Lionel Seinturier, Alexandre van Kempen

TL;DR

The paper tackles unpredictable I/O performance in shared HPC storage by proposing an end-to-end feedback-control framework that dynamically regulates client-side I/O rates. It defines a practical sensor (dispatch queue depth) and an actuator (client-side bandwidth throttling via tc), derives a simple first-order model $q(k+1) = a \cdot q(k) + b \cdot bw(k)$, and implements a discrete-time PI controller with $bw(k) = K_p e(k) + K_i T_s \sum_{j=0}^{k} e(j)$ using $T_s = 300\mathrm{ms}$. Through Grid'5000 experiments with a write-intensive workload, the approach achieves stable convergence, with up to 20% reduction in total runtime and up to 35% tail-latency improvement for certain control targets, demonstrating the viability of control theory for stabilizing I/O in HPC. The work suggests that end-to-end, workload-agnostic congestion mitigation can be portable across architectures and lays out future directions including noise filtering, workload adaptivity, and distributed/multi-controller designs.

Abstract

Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.

Mitigating Shared Storage Congestion Using Control Theory

TL;DR

The paper tackles unpredictable I/O performance in shared HPC storage by proposing an end-to-end feedback-control framework that dynamically regulates client-side I/O rates. It defines a practical sensor (dispatch queue depth) and an actuator (client-side bandwidth throttling via tc), derives a simple first-order model , and implements a discrete-time PI controller with using . Through Grid'5000 experiments with a write-intensive workload, the approach achieves stable convergence, with up to 20% reduction in total runtime and up to 35% tail-latency improvement for certain control targets, demonstrating the viability of control theory for stabilizing I/O in HPC. The work suggests that end-to-end, workload-agnostic congestion mitigation can be portable across architectures and lays out future directions including noise filtering, workload adaptivity, and distributed/multi-controller designs.

Abstract

Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.

Paper Structure

This paper contains 23 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview of the control on the I/O path on computing cluster
  • Figure 2: Graphical explanation of the different properties of a controlled system (stability, settling time, overshoot) article
  • Figure 3: Open-Loop Experiments used for Identification -- No Feedback or Control: The upper plot illustrates the system's static behavior under fixed values of input, while the lower plot depicts its dynamic response to input varying during run-time.
  • Figure 4: Control results - Top plot represents the dispatch queue size (raw and rolling average) response to control target changes over time. Green line is the average of the system over fixed control targets, showing that the controller manages to reach the desired target on average. Bottom plot is the bandwidth limit action decided by the controller over time
  • Figure 5: Control results with multiple control gain configurations. This shows the impact of the controller gains in the quality of the control.
  • ...and 3 more figures