Can Large Language Models Write Parallel Code?
Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatele
TL;DR
The paper introduces ParEval, a comprehensive benchmark designed to evaluate large language models on their ability to generate and translate parallel code across multiple problem types and execution models. It defines two novel metrics, speedup$_n$@k and efficiency$_n$@k, to quantify runtime performance and scaling of generated code, alongside the standard pass@$k$ correctness metric. Through extensive experiments with open- and closed-source LLMs, the study finds that all models struggle to produce correct and performant parallel code, with MPI being the most challenging and OpenMP/Kokkos the most tractable among parallel models. Translation between execution models helps correctness substantially, especially for smaller models, but does not reliably improve performance or scalability. The work highlights the need for specialized parallel-code models and provides a publicly available benchmark to guide future improvements in parallel code generation for HPC tasks.
Abstract
Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.
