Table of Contents
Fetching ...

IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs

Chris Egersdoerfer, Arnav Sareen, Jean Luca Bez, Suren Byna, Dongkuan, Xu, Dong Dai

TL;DR

OAgent integrates various new designs, including a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file, and has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.

Abstract

As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions.In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to ask targeted follow-up questions. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Using this test suite, we conducted extensive evaluations, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.

IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs

TL;DR

OAgent integrates various new designs, including a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file, and has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.

Abstract

As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions.In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to ask targeted follow-up questions. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Using this test suite, we conducted extensive evaluations, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.
Paper Structure (22 sections, 2 equations, 6 figures, 4 tables)

This paper contains 22 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example diagnosis done by directly querying two LLMs: gpt-4 and gpt-4o.
  • Figure 2: The overall workflow of IOAgent and its key components.
  • Figure 3: An example prompt and response transforming JSON summary fragment to natural language
  • Figure 4: A, B and C denote three augmentations applied.
  • Figure 5: Post-diagnosis example between user and IOAgent.
  • ...and 1 more figures