STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

Chris Egersdoerfer; Philip Carns; Shane Snyder; Robert Ross; Dong Dai

STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

Chris Egersdoerfer, Philip Carns, Shane Snyder, Robert Ross, Dong Dai

TL;DR

STELLAR, an autonomous tuner for high-performance parallel file systems, integrates retrieval-augmented generation, external tool execution, LLM-based reasoning, and a multiagent design to stabilize reasoning and combat hallucinations, thus providing insight into the design of similar systems for other optimization problems.

Abstract

I/O performance is crucial to efficiency in data-intensive scientific computing; but tuning large-scale storage systems is complex, costly, and notoriously manpower-intensive, making it inaccessible for most domain scientists. To address this problem, we propose STELLAR, an autonomous tuner for high-performance parallel file systems. Our evaluations show that STELLAR almost always selects near-optimal parameter configurations for parallel file systems within the first five attempts, even for previously unseen applications. STELLAR differs fundamentally from traditional autotuning methods, which often require hundreds of thousands of iterations to converge. Powered by large language models (LLMs), STELLAR enables autonomous end-to-end agentic tuning by (1) accurately extracting tunable parameters from software manuals, (2) analyzing I/O trace logs generated by applications, (3) selecting initial tuning strategies, (4) rerunning applications on real systems and collecting I/O performance feedback, (5) adjusting tuning strategies and repeating the tuning cycle, and (6) reflecting on and summarizing tuning experiences into reusable knowledge for future optimizations. STELLAR integrates retrieval-augmented generation (RAG), tool execution, LLM-based reasoning, and a multiagent design to stabilize reasoning and combat hallucinations. We evaluate the impact of each component on optimization outcomes, providing design insights for similar systems in other optimization domains. STELLAR's architecture and empirical results highlight a promising approach to complex system optimization, especially for problems with large search spaces and high exploration costs, while making I/O tuning more accessible to domain scientists with minimal added resources.

STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

TL;DR

Abstract

Paper Structure (40 sections, 10 figures)

This paper contains 40 sections, 10 figures.

Introduction
Background
HPC Parallel File System Tuning
Tunable Parameter Importance
I/O Patterns and Profiling
Large Language Models
Workflows and Agents.
Tool Usage
Self-Learning
Hallucination Issues of LLM Agents
Related Work
Autotuning for HPC Parallel File Systems
LLM-Based Database Tuning
Design and Implementation
Overall Workflow
...and 25 more sections

Figures (10)

Figure 1: STELLAR design overview. The four numbered elements represent the four key modules in STELLAR.
Figure 2: Example of LLM hallucinations for storage system parameter details. We also show the RAG-based extraction result of STELLAR on the same parameter. Note that our RAG-based extraction leverages the older GPT-4o model.
Figure 3: Example of decision-making via interactions between the Analysis Agent and Tuning Agent.
Figure 4: Example of generated tuning rule.
Figure 5: Comparison of STELLAR's tuning performance with default and human expert baselines. Smaller values are better.
...and 5 more figures

STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

TL;DR

Abstract

STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)