Table of Contents
Fetching ...

HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages

Aman Chaturvedi, Daniel Nichols, Siddharth Singh, Abhinav Bhatele

TL;DR

The paper addresses the challenge of generating correct parallel code with LLMs by creating a large synthetic HPC dataset (HPC-Instruct) and systematically studying data, model, and prompt factors. It shows that fine-tuning smaller base models on high-quality HPC data yields strong parallel-code generation, with data quality and model size being critical drivers of performance. The result is HPC-Coder-V2, an open-source family that delivers state-of-the-art parallel code generation performance with favorable speed and memory characteristics, approaching GPT-scale capabilities on parallel tasks. This work provides practical guidelines for building HPC-aware code LLMs and supplies datasets and models that can accelerate future HPC AI developer tooling.

Abstract

Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date.

HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages

TL;DR

The paper addresses the challenge of generating correct parallel code with LLMs by creating a large synthetic HPC dataset (HPC-Instruct) and systematically studying data, model, and prompt factors. It shows that fine-tuning smaller base models on high-quality HPC data yields strong parallel-code generation, with data quality and model size being critical drivers of performance. The result is HPC-Coder-V2, an open-source family that delivers state-of-the-art parallel code generation performance with favorable speed and memory characteristics, approaching GPT-scale capabilities on parallel tasks. This work provides practical guidelines for building HPC-aware code LLMs and supplies datasets and models that can accelerate future HPC AI developer tooling.

Abstract

Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date.

Paper Structure

This paper contains 32 sections, 1 equation, 13 figures, 1 table.

Figures (13)

  • Figure 1: Overview of the methodology proposed in this paper. First, we use open-source parallel code snippets to generate a large synthetic instruction dataset of parallel code samples. We then conduct ablation studies to understand how data, model, and fine-tuning parameters impact the capability of a code LLM to write parallel code. Finally, we utilize the dataset and insights from the ablation studies to fine-tune a code LLM for parallel code generation and evaluate it against other code LLMs on the parallel code generation benchmark ParEval.
  • Figure 2: Synthetic data generation process. We collect seed snippets from open source codebases and combine them with multiple prompt templates to create data generation prompts for an LLM. These prompts are then used to generate problem-solution pairs with an LLM.
  • Figure 3: Example synthetic data generation output. Here, a random seed snippet is used alongside the translation prompt template and fed into the LLM. The resulting synthetic sample from the LLM is a problem of translating some code to OpenMP and the corresponding solution.
  • Figure 4: ParEval parallel code generation scores for various prompt formats. Results are shown for 8 total model configurations: {masked, unmasked} gradients $\times$ {instruct, non-instruct} base models $\times$ {1.3B, 6.7B} model sizes. There is no correlation in parallel code generation performance between masked and unmasked gradients, however, fine-tuning the base model rather than the instruct gives much better results for both 1.3B and 6.7B models.
  • Figure 5: ParEval MPI code generation performance for increasing amounts of MPI fine-tuning date. As the amount of MPI fine-tuning date increases the smaller 1.3B model sees an increase in ability to generate MPI code with diminishing returns after 6k samples. The larger 6.7B model sees no improvement in MPI code generation performance with additional data.
  • ...and 8 more figures