Table of Contents
Fetching ...

Can Large Language Models Write Parallel Code?

Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatele

TL;DR

The paper introduces ParEval, a comprehensive benchmark designed to evaluate large language models on their ability to generate and translate parallel code across multiple problem types and execution models. It defines two novel metrics, speedup$_n$@k and efficiency$_n$@k, to quantify runtime performance and scaling of generated code, alongside the standard pass@$k$ correctness metric. Through extensive experiments with open- and closed-source LLMs, the study finds that all models struggle to produce correct and performant parallel code, with MPI being the most challenging and OpenMP/Kokkos the most tractable among parallel models. Translation between execution models helps correctness substantially, especially for smaller models, but does not reliably improve performance or scalability. The work highlights the need for specialized parallel-code models and provides a publicly available benchmark to guide future improvements in parallel code generation for HPC tasks.

Abstract

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.

Can Large Language Models Write Parallel Code?

TL;DR

The paper introduces ParEval, a comprehensive benchmark designed to evaluate large language models on their ability to generate and translate parallel code across multiple problem types and execution models. It defines two novel metrics, speedup@k and efficiency@k, to quantify runtime performance and scaling of generated code, alongside the standard pass@ correctness metric. Through extensive experiments with open- and closed-source LLMs, the study finds that all models struggle to produce correct and performant parallel code, with MPI being the most challenging and OpenMP/Kokkos the most tractable among parallel models. Translation between execution models helps correctness substantially, especially for smaller models, but does not reliably improve performance or scalability. The work highlights the need for specialized parallel-code models and provides a publicly available benchmark to guide future improvements in parallel code generation for HPC tasks.

Abstract

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.
Paper Structure (26 sections, 5 equations, 11 figures, 3 tables)

This paper contains 26 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Each LLM's pass@1 score over ParEval. All of the LLMs score significantly worse in generating parallel code than serial code.
  • Figure 2: The pass@k for various values of k. The relative order of the LLMs is the same for all values of k with Phind-V2 leading the group.
  • Figure 3: pass@1 for each execution model. The LLMs generally follow the same distribution of scores across the execution models: serial (best), OpenMP, CUDA/HIP, and MPI/MPI+OpenMP (worst) with Kokkos varying between LLMs.
  • Figure 4: pass@1 for each problem type. The LLMs are best at transform problems, while they are worst at sparse linear algebra problems.
  • Figure 5: pass@1 for GPT-4 across all execution models and problem types. GPT-4 excels with the Kokkos and OpenMP execution models, while getting more problems correct for transform, search, and reduce problems.
  • ...and 6 more figures