Table of Contents
Fetching ...

LLMs as Packagers of HPC Software

Caetano Melone, Daniel Nichols, Konstantinos Parasyris, Todd Gamblin, Harshitha Menon

TL;DR

This work addresses the challenge of automating Spack recipe generation for HPC software, a domain characterized by heterogeneous build systems and complex dependency graphs. It introduces SpackIt, an agentic framework that couples repository analysis, retrieval of relevant examples, and an iterative self-repair loop guided by diagnostic feedback to generate valid Spack recipes. Through a large-scale study on 308 E4S CMake-based HPC packages, SpackIt raises installation success from 19.7% in zero-shot settings to approximately 83% in the best configuration, demonstrating the value of retrieval-augmented context and structured feedback for reliable package synthesis. The approach advances reproducibility and efficiency in HPC software packaging by grounding model reasoning in repository metadata and domain-specific conventions, and it provides a replication package to support further research.

Abstract

High performance computing (HPC) software ecosystems are inherently heterogeneous, comprising scientific applications that depend on hundreds of external packages, each with distinct build systems, options, and dependency constraints. Tools such as Spack automate dependency resolution and environment management, but their effectiveness relies on manually written build recipes. As these ecosystems grow, maintaining existing specifications and creating new ones becomes increasingly labor-intensive. While large language models (LLMs) have shown promise in code generation, automatically producing correct and maintainable Spack recipes remains a significant challenge. We present a systematic analysis of how LLMs and context-augmentation methods can assist in the generation of Spack recipes. To this end, we introduce SpackIt, an end-to-end framework that combines repository analysis, retrieval of relevant examples, and iterative refinement through diagnostic feedback. We apply SpackIt to a representative subset of 308 open-source HPC packages to assess its effectiveness and limitations. Our results show that SpackIt increases installation success from 20% in a zero-shot setting to over 80% in its best configuration, demonstrating the value of retrieval and structured feedback for reliable package synthesis.

LLMs as Packagers of HPC Software

TL;DR

This work addresses the challenge of automating Spack recipe generation for HPC software, a domain characterized by heterogeneous build systems and complex dependency graphs. It introduces SpackIt, an agentic framework that couples repository analysis, retrieval of relevant examples, and an iterative self-repair loop guided by diagnostic feedback to generate valid Spack recipes. Through a large-scale study on 308 E4S CMake-based HPC packages, SpackIt raises installation success from 19.7% in zero-shot settings to approximately 83% in the best configuration, demonstrating the value of retrieval-augmented context and structured feedback for reliable package synthesis. The approach advances reproducibility and efficiency in HPC software packaging by grounding model reasoning in repository metadata and domain-specific conventions, and it provides a replication package to support further research.

Abstract

High performance computing (HPC) software ecosystems are inherently heterogeneous, comprising scientific applications that depend on hundreds of external packages, each with distinct build systems, options, and dependency constraints. Tools such as Spack automate dependency resolution and environment management, but their effectiveness relies on manually written build recipes. As these ecosystems grow, maintaining existing specifications and creating new ones becomes increasingly labor-intensive. While large language models (LLMs) have shown promise in code generation, automatically producing correct and maintainable Spack recipes remains a significant challenge. We present a systematic analysis of how LLMs and context-augmentation methods can assist in the generation of Spack recipes. To this end, we introduce SpackIt, an end-to-end framework that combines repository analysis, retrieval of relevant examples, and iterative refinement through diagnostic feedback. We apply SpackIt to a representative subset of 308 open-source HPC packages to assess its effectiveness and limitations. Our results show that SpackIt increases installation success from 20% in a zero-shot setting to over 80% in its best configuration, demonstrating the value of retrieval and structured feedback for reliable package synthesis.

Paper Structure

This paper contains 35 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Example of a Spack package defining a software build specification. Each recipe is implemented as a Python class that declares versions, build variants, and dependencies. This structure enables configurability and reproducibility across installation targets.
  • Figure 2: Overview of our approach. The pipeline is coordinated by a single agent that orchestrates specialized tools for extraction, retrieval, generation, evaluation, and repair. It begins by analyzing a repository to collect build metadata and directory structure. Examples from either graph-based or embedding-based retrieval tools are combined with this metadata to form the context for generation. The resulting recipe is then validated in an isolated build environment. If failures occur, the error-aware repair loop leverages feedback from Spack outputs to iteratively refine the package for up to $k$ attempts.
  • Figure 3: An example Spack recipe that installs successfully, but omits essential details such as optional dependencies, debug/release flags, and version conflicts.
  • Figure 4: Cumulative installation success for GPT-5 over five repair attempts, grouped by reference type. All configurations improve with the agentic repair loop, with two similar references achieving the highest success rate (82.9%).
  • Figure 5: Cumulative installation success for GPT-5 with one similar reference across up to 30 repair attempts. Success rises rapidly through the first iterations, reaching 74% by the fifth and plateauing near 83% after 15 attempts.
  • ...and 7 more figures