Large Language Models for Compiler Optimization

Chris Cummins; Volker Seeker; Dejan Grubisic; Mostafa Elhoushi; Youwei Liang; Baptiste Roziere; Jonas Gehring; Fabian Gloeckle; Kim Hazelwood; Gabriel Synnaeve; Hugh Leather

Large Language Models for Compiler Optimization

Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather

TL;DR

The paper investigates using a 7B-parameter Transformer trained from scratch to optimize LLVM-IR by predicting compiler pass orders and generating optimized code, with auxiliary tasks (instruction counts and optimized IR) to boost learning. It achieves a code-size improvement of about $3.0\%$ over the compiler baseline and demonstrates that the model can produce compilable, near-ground-truth code in the majority of cases without running the compiler, while matching or exceeding several state-of-the-art baselines in many settings. The work highlights the potential of LLMs for performance-aware code optimization, showing strong code-reasoning abilities but also revealing limitations in context length and arithmetic reasoning that motivate future research. Overall, the results establish a promising direction for using LLMs to augment or replace costly autotuning in compiler optimization tasks, with room for improvements in long-range context handling and verification of semantic equivalence.

Abstract

We explore the novel application of Large Language Models to code optimization. We present a 7B-parameter transformer model trained from scratch to optimize LLVM assembly for code size. The model takes as input unoptimized assembly and outputs a list of compiler options to best optimize the program. Crucially, during training, we ask the model to predict the instruction counts before and after optimization, and the optimized code itself. These auxiliary learning tasks significantly improve the optimization performance of the model and improve the model's depth of understanding. We evaluate on a large suite of test programs. Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler, outperforming two state-of-the-art baselines that require thousands of compilations. Furthermore, the model shows surprisingly strong code reasoning abilities, generating compilable code 91% of the time and perfectly emulating the output of the compiler 70% of the time.

Large Language Models for Compiler Optimization

TL;DR

over the compiler baseline and demonstrates that the model can produce compilable, near-ground-truth code in the majority of cases without running the compiler, while matching or exceeding several state-of-the-art baselines in many settings. The work highlights the potential of LLMs for performance-aware code optimization, showing strong code-reasoning abilities but also revealing limitations in context length and arithmetic reasoning that motivate future research. Overall, the results establish a promising direction for using LLMs to augment or replace costly autotuning in compiler optimization tasks, with room for improvements in long-range context handling and verification of semantic equivalence.

Abstract

Paper Structure (23 sections, 23 figures, 6 tables)

This paper contains 23 sections, 23 figures, 6 tables.

Introduction
Pass Ordering with LLMs
Prompts
LLVM-IR Normalization
The Model
Model Architecture
Training Data
Training
Evaluation
Training Results
Comparison to State-of-the-Art
Evaluation of Generated Pass Lists
Evaluation of Generated Code
Additional Experiments
Abalation of Dataset Size
...and 8 more sections

Figures (23)

Figure 1: Overview of our approach, showing the model input (Prompt) and output (Answer) during training and inference. The prompt contains unoptimized code. The answer contains an optimization pass list, instruction counts, and the optimized code. During inference we generate only the optimization pass list which we feed into the compiler, ensuring that the optimized code is correct.
Figure 2: Performance on holdout validation set during training. We evaluate performance every 250 training steps (131M train tokens). Parity with -Oz is reached at 393M tokens and peak performance at 10.9B tokens.
Figure 3: Frequency that passes occur in the pass list for each of the 100,000 test programs (left), and the length of pass lists (right). -Oz is the starting point for the autotuner and is the dominant result, being the best-found result for 93.2% of autotuned test programs and appearing in an additional 0.6% of pass lists as part of a longer sequence. The model-generated pass distribution tracks the autotuner but slightly overpredicts -Oz (94.3%) and includes 9 passes that the autotuner used on the training set but not on the test set. Results are ordered by decreasing autotuner frequency.
Figure 4: Input code (39 instructions).
Figure 5: Autotuned code (14 instructions) using passes: -reg2mem -instcombine -Os -O1.
...and 18 more figures

Large Language Models for Compiler Optimization

TL;DR

Abstract

Large Language Models for Compiler Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (23)