Table of Contents
Fetching ...

Do Large Language Models Understand Performance Optimization?

Bowen Cui, Tejas Ramesh, Oscar Hernandez, Keren Zhou

TL;DR

This work interrogates whether large language models can meaningfully optimize HPC code by contrasting them with a traditional performance tool in a multi-motif benchmark. It introduces a 26-kernel benchmark across 11 HPC motifs and a performance-optimization agent that couples LLMs with a static/dynamic analysis workflow, evaluated on CPU architectures with GCC and CLANG. Across speedups, correctness, and HPC commonsense, the study finds that compiler-based optimization via Codee delivers the most reliable performance and correctness, while LLMs show potential for adaptive, algorithmic improvements but frequently produce incorrect results or poor parallelizations. The results argue for hybrid systems that integrate LLM reasoning with traditional HPC analysis to leverage strengths from both paradigms and guide future hardware-aware optimization research.

Abstract

Large Language Models (LLMs) have emerged as powerful tools for software development tasks such as code completion, translation, and optimization. However, their ability to generate efficient and correct code, particularly in complex High-Performance Computing (HPC) contexts, has remained underexplored. To address this gap, this paper presents a comprehensive benchmark suite encompassing multiple critical HPC computational motifs to evaluate the performance of code optimized by state-of-the-art LLMs, including OpenAI o1, Claude-3.5, and Llama-3.2. In addition to analyzing basic computational kernels, we developed an agent system that integrates LLMs to assess their effectiveness in real HPC applications. Our evaluation focused on key criteria such as execution time, correctness, and understanding of HPC-specific concepts. We also compared the results with those achieved using traditional HPC optimization tools. Based on the findings, we recognized the strengths of LLMs in understanding human instructions and performing automated code transformations. However, we also identified significant limitations, including their tendency to generate incorrect code and their challenges in comprehending complex control and data flows in sophisticated HPC code.

Do Large Language Models Understand Performance Optimization?

TL;DR

This work interrogates whether large language models can meaningfully optimize HPC code by contrasting them with a traditional performance tool in a multi-motif benchmark. It introduces a 26-kernel benchmark across 11 HPC motifs and a performance-optimization agent that couples LLMs with a static/dynamic analysis workflow, evaluated on CPU architectures with GCC and CLANG. Across speedups, correctness, and HPC commonsense, the study finds that compiler-based optimization via Codee delivers the most reliable performance and correctness, while LLMs show potential for adaptive, algorithmic improvements but frequently produce incorrect results or poor parallelizations. The results argue for hybrid systems that integrate LLM reasoning with traditional HPC analysis to leverage strengths from both paradigms and guide future hardware-aware optimization research.

Abstract

Large Language Models (LLMs) have emerged as powerful tools for software development tasks such as code completion, translation, and optimization. However, their ability to generate efficient and correct code, particularly in complex High-Performance Computing (HPC) contexts, has remained underexplored. To address this gap, this paper presents a comprehensive benchmark suite encompassing multiple critical HPC computational motifs to evaluate the performance of code optimized by state-of-the-art LLMs, including OpenAI o1, Claude-3.5, and Llama-3.2. In addition to analyzing basic computational kernels, we developed an agent system that integrates LLMs to assess their effectiveness in real HPC applications. Our evaluation focused on key criteria such as execution time, correctness, and understanding of HPC-specific concepts. We also compared the results with those achieved using traditional HPC optimization tools. Based on the findings, we recognized the strengths of LLMs in understanding human instructions and performing automated code transformations. However, we also identified significant limitations, including their tendency to generate incorrect code and their challenges in comprehending complex control and data flows in sophisticated HPC code.

Paper Structure

This paper contains 37 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: An Overview of our evaluation framework.
  • Figure 2: An example of Codee suggesting optimizations for MATMUL
  • Figure 3: Speedups of single and multiple round serial optimizations. The Y axis indicates the speedup and the horizontal lines denote a speedup of 1.0. Benchmarks that achieved speedup >=5.9 in multiple round optimization across GCC/G++ and CLANG/CLANG++ compilers have been marked above the respective bars.
  • Figure 4: Results from parallel optimizations using GCC/G++ compiler. The Y axis indicates the speedup.
  • Figure 5: Results from parallel optimizations using CLANG/CLANG++ compiler. The Y axis indicates the speedup.
  • ...and 6 more figures