Table of Contents
Fetching ...

DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System

Md Hasanur Rashid, Xinyi Li, Youbiao He, Forrest Sheng Bao, Dong Dai

TL;DR

DIAL (Decentralized I/O AutoTuning via Learned Client-side Local Metrics) takes a decentralized approach, treating each I/O client as an independent unit and tuning configurations using only its locally observable metrics.

Abstract

Enabling efficient, high-performance data access in parallel file systems (PFS) is critical for today's high-performance computing systems. PFS client-side I/O heavily impacts the final I/O performance delivered to individual applications and the entire system. Autotuning the key client-side I/O behaviors has been extensively studied and shows promising results. However, existing work has heavily relied on extensive number of global runtime metrics to monitor and accurate modeling of applications' I/O patterns. Such heavy overheads significantly limit the ability to enable fine-grained, dynamic tuning in practical systems. In this study, we propose DIAL (Decentralized I/O AutoTuning via Learned Client-side Local Metrics) which takes a drastically different approach. Instead of trying to extract the global I/O patterns of applications, DIAL takes a decentralized approach, treating each I/O client as an independent unit and tuning configurations using only its locally observable metrics. With the help of machine learning models, DIAL enables multiple tunable units to make independent but collective decisions, reacting to what is happening in the global storage systems in a timely manner and achieving better I/O performance globally for the application.

DIAL: Decentralized I/O AutoTuning via Learned Client-side Local Metrics for Parallel File System

TL;DR

DIAL (Decentralized I/O AutoTuning via Learned Client-side Local Metrics) takes a decentralized approach, treating each I/O client as an independent unit and tuning configurations using only its locally observable metrics.

Abstract

Enabling efficient, high-performance data access in parallel file systems (PFS) is critical for today's high-performance computing systems. PFS client-side I/O heavily impacts the final I/O performance delivered to individual applications and the entire system. Autotuning the key client-side I/O behaviors has been extensively studied and shows promising results. However, existing work has heavily relied on extensive number of global runtime metrics to monitor and accurate modeling of applications' I/O patterns. Such heavy overheads significantly limit the ability to enable fine-grained, dynamic tuning in practical systems. In this study, we propose DIAL (Decentralized I/O AutoTuning via Learned Client-side Local Metrics) which takes a drastically different approach. Instead of trying to extract the global I/O patterns of applications, DIAL takes a decentralized approach, treating each I/O client as an independent unit and tuning configurations using only its locally observable metrics. With the help of machine learning models, DIAL enables multiple tunable units to make independent but collective decisions, reacting to what is happening in the global storage systems in a timely manner and achieving better I/O performance globally for the application.
Paper Structure (14 sections, 3 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The detailed I/O path of Lustre.
  • Figure 2: Architecture of the DIAL framework
  • Figure 3: Deep learning application executions.