Do Large Language Models Understand Performance Optimization?
Bowen Cui, Tejas Ramesh, Oscar Hernandez, Keren Zhou
TL;DR
This work interrogates whether large language models can meaningfully optimize HPC code by contrasting them with a traditional performance tool in a multi-motif benchmark. It introduces a 26-kernel benchmark across 11 HPC motifs and a performance-optimization agent that couples LLMs with a static/dynamic analysis workflow, evaluated on CPU architectures with GCC and CLANG. Across speedups, correctness, and HPC commonsense, the study finds that compiler-based optimization via Codee delivers the most reliable performance and correctness, while LLMs show potential for adaptive, algorithmic improvements but frequently produce incorrect results or poor parallelizations. The results argue for hybrid systems that integrate LLM reasoning with traditional HPC analysis to leverage strengths from both paradigms and guide future hardware-aware optimization research.
Abstract
Large Language Models (LLMs) have emerged as powerful tools for software development tasks such as code completion, translation, and optimization. However, their ability to generate efficient and correct code, particularly in complex High-Performance Computing (HPC) contexts, has remained underexplored. To address this gap, this paper presents a comprehensive benchmark suite encompassing multiple critical HPC computational motifs to evaluate the performance of code optimized by state-of-the-art LLMs, including OpenAI o1, Claude-3.5, and Llama-3.2. In addition to analyzing basic computational kernels, we developed an agent system that integrates LLMs to assess their effectiveness in real HPC applications. Our evaluation focused on key criteria such as execution time, correctness, and understanding of HPC-specific concepts. We also compared the results with those achieved using traditional HPC optimization tools. Based on the findings, we recognized the strengths of LLMs in understanding human instructions and performing automated code transformations. However, we also identified significant limitations, including their tendency to generate incorrect code and their challenges in comprehending complex control and data flows in sophisticated HPC code.
