Mitigating Shared Storage Congestion Using Control Theory
Thomas Collignon, Kouds Halitim, Raphaël Bleuse, Sophie Cerf, Bogdan Robu, Éric Rutten, Lionel Seinturier, Alexandre van Kempen
TL;DR
The paper tackles unpredictable I/O performance in shared HPC storage by proposing an end-to-end feedback-control framework that dynamically regulates client-side I/O rates. It defines a practical sensor (dispatch queue depth) and an actuator (client-side bandwidth throttling via tc), derives a simple first-order model $q(k+1) = a \cdot q(k) + b \cdot bw(k)$, and implements a discrete-time PI controller with $bw(k) = K_p e(k) + K_i T_s \sum_{j=0}^{k} e(j)$ using $T_s = 300\mathrm{ms}$. Through Grid'5000 experiments with a write-intensive workload, the approach achieves stable convergence, with up to 20% reduction in total runtime and up to 35% tail-latency improvement for certain control targets, demonstrating the viability of control theory for stabilizing I/O in HPC. The work suggests that end-to-end, workload-agnostic congestion mitigation can be portable across architectures and lays out future directions including noise filtering, workload adaptivity, and distributed/multi-controller designs.
Abstract
Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.
