AutoParLLM: GNN-guided Context Generation for Zero-Shot Code Parallelization using LLMs

Quazi Ishtiaque Mahmud; Ali TehraniJamsaz; Hung Phan; Le Chen; Mihai Capotă; Theodore Willke; Nesreen K. Ahmed; Ali Jannesari

AutoParLLM: GNN-guided Context Generation for Zero-Shot Code Parallelization using LLMs

Quazi Ishtiaque Mahmud, Ali TehraniJamsaz, Hung Phan, Le Chen, Mihai Capotă, Theodore Willke, Nesreen K. Ahmed, Ali Jannesari

TL;DR

AutoParLLM introduces a GNN-guided context-generation framework to enhance zero-shot OpenMP code parallelization via large language models. By training GNNs on PerfoGraph representations to predict parallelism and patterns, AutoParLLM produces context-rich prompts that guide LLMs to generate correct and efficient parallel code, quantified by a novel OMPScore metric. Evaluations on NAS and Rodinia show substantial improvements in CodeBERTScore and directive quality, along with notable speedups and enhanced developer productivity; OpenACC extension results demonstrate adaptability to other parallel models. The work advances scaffolding for LLM-assisted HPC code generation by tightly integrating graph-based program analysis with prompt engineering, yielding practical benefits for parallelization tasks.

Abstract

In-Context Learning (ICL) has been shown to be a powerful technique to augment the capabilities of LLMs for a diverse range of tasks. This work proposes \ourtool, a novel way to generate context using guidance from graph neural networks (GNNs) to generate efficient parallel codes. We evaluate \ourtool \xspace{} on $12$ applications from two well-known benchmark suites of parallel codes: NAS Parallel Benchmark and Rodinia Benchmark. Our results show that \ourtool \xspace{} improves the state-of-the-art LLMs (e.g., GPT-4) by 19.9\% in NAS and 6.48\% in Rodinia benchmark in terms of CodeBERTScore for the task of parallel code generation. Moreover, \ourtool \xspace{} improves the ability of the most powerful LLM to date, GPT-4, by achieving $\approx$17\% (on NAS benchmark) and $\approx$16\% (on Rodinia benchmark) better speedup. In addition, we propose \ourscore \xspace{} for evaluating the quality of the parallel code and show its effectiveness in evaluating parallel codes. \ourtool \xspace is available at https://github.com/quazirafi/AutoParLLM.git.

AutoParLLM: GNN-guided Context Generation for Zero-Shot Code Parallelization using LLMs

TL;DR

Abstract

applications from two well-known benchmark suites of parallel codes: NAS Parallel Benchmark and Rodinia Benchmark. Our results show that \ourtool \xspace{} improves the state-of-the-art LLMs (e.g., GPT-4) by 19.9\% in NAS and 6.48\% in Rodinia benchmark in terms of CodeBERTScore for the task of parallel code generation. Moreover, \ourtool \xspace{} improves the ability of the most powerful LLM to date, GPT-4, by achieving

17\% (on NAS benchmark) and

16\% (on Rodinia benchmark) better speedup. In addition, we propose \ourscore \xspace{} for evaluating the quality of the parallel code and show its effectiveness in evaluating parallel codes. \ourtool \xspace is available at https://github.com/quazirafi/AutoParLLM.git.

Paper Structure (49 sections, 1 equation, 13 figures, 12 tables)

This paper contains 49 sections, 1 equation, 13 figures, 12 tables.

Introduction
Background
Approach
Training
Data Collection and Preprocessing
Program Representation
Graph Neural Network (GNN) Training
Inference
Prompt Engineering
OMPScore
Experimental Results
Experimental Setup
Parallelism Detection Module
Pattern Detection Module
GNN Classifier
...and 34 more sections

Figures (13)

Figure 1: Effect of AutoParLLM. ALLM = AutoParLLM applied (Green Bars). Average speedup(%) gain of GPT-4 is improved by 17.7% (Intel) & 17.2% (AMD) on NAS and by 16.1% (Intel) & 19.5% (AMD) on Rodinia. LLMs are prompted with few shot settings & speedups are reported using 4 threads. (Comparison with more LLMs in Appendix \ref{['appendix:speeup-all-llms']}.)
Figure 2: Overview of the AutoParLLM workflow.
Figure 3: Overview of OMPScore.
Figure 4: Speedup gain across individual applications in NAS Parallel Benchmark. ALLM-GPT-4 achieves max 24.7% and 28.6% better speedup than GPT-4 for CG in Intel and AMD cpus, respectively.
Figure 5: Speedup gain across individual applications in Rodinia-3.1 Benchmark. ALLM-GPT-4 achieves max 40.6% and 30.2% better speedup than GPT-4 for Heartwall in Intel and AMD cpus, respectively.
...and 8 more figures

AutoParLLM: GNN-guided Context Generation for Zero-Shot Code Parallelization using LLMs

TL;DR

Abstract

AutoParLLM: GNN-guided Context Generation for Zero-Shot Code Parallelization using LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (13)