Table of Contents
Fetching ...

CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems

Md Hasanur Rashid, Nathan R. Tallent, Forrest Sheng Bao, Dong Dai

TL;DR

This work presents CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics, and believes it has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.

Abstract

Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.

CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems

TL;DR

This work presents CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics, and believes it has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.

Abstract

Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.
Paper Structure (25 sections, 8 figures, 8 tables, 2 algorithms)

This paper contains 25 sections, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: The I/O clients serve individual I/O request.
  • Figure 2: (a) The overall architecture and (b) the detailed I/O path of Lustre.
  • Figure 3: An illustration of how I/O requests are split into pages and aggregated into fixed-size RPC extents.
  • Figure 4: Architecture of the CARAT framework.
  • Figure 5: The Two-stage tuning strategy in CARAT.
  • ...and 3 more figures