HPC-Coder: Modeling Parallel Programs using Large Language Models

Daniel Nichols; Aniruddha Marathe; Harshitha Menon; Todd Gamblin; Abhinav Bhatele

HPC-Coder: Modeling Parallel Programs using Large Language Models

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

TL;DR

The paper tackles the difficulty of modeling performance and parallel behavior in HPC code by fine-tuning an HPC-focused LLM, HPC-Coder, on a large HPC code dataset. It demonstrates three downstream capabilities: HPC code generation, OpenMP pragma labeling, and relative performance prediction, achieving superior results on HPC-specific tasks compared to baseline models. Key contributions include a large HPC/source-code dataset, the HPC-Coder model with strong code-generation and pragma-labeling performance, and the ability to predict relative code performance with high accuracy using limited data. This work showcases the potential of domain-specific LLMs to automate HPC development and provide performance-aware insights, enabling faster, more reliable code optimization and deployment in exascale-era workloads.

Abstract

Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.

HPC-Coder: Modeling Parallel Programs using Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 9 equations, 12 figures, 4 tables)

This paper contains 26 sections, 9 equations, 12 figures, 4 tables.

Introduction
Background
Large Language Models
Text Generation
Using LLMs for Code Generation
Overview of the Proposed Methodology
Data Gathering and Pre-processing
HPC Source Code Data
Data Pre-processing
Performance Datasets
Fine-tuning Methodology
Models Selected For Fine-tuning
Fine-tuning Setup and Hyperparameters
Downstream Inference Tasks and Evaluation Metrics
Code Completion
...and 11 more sections

Figures (12)

Figure 1: Overview of the steps described in this paper to train an HPC specific model and run it on several downstream tasks. After collecting a large dataset of HPC code we fine-tune several pre-trained language models and select the best one. The selected model is then used to generate code, label OpenMP pragmas, and predict relative performance as part of several downstream tasks.
Figure 2: Distribution of no. of lines of code in each file type. .cxx, .hh, .H, and .hxx files are included in the dataset, but omitted here due to small counts.
Figure 3: An example prompt asking the model to generate a parallel version of saxpy. The comment and function header make up the prompt. The function body on the bottom shows a potential model output.
Figure 4: Downstream evaluation performance across training iterations for PolyCoder+HPC. The model starts to perform worse around 45,000 samples even though the perplexity keeps improving.
Figure 5: Comparison of models on code generation. The clusters represent the average pass@k scores for $k=1,10$ and $100$. Higher percentage is better.
...and 7 more figures

HPC-Coder: Modeling Parallel Programs using Large Language Models

TL;DR

Abstract

HPC-Coder: Modeling Parallel Programs using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)