Compiler generated feedback for Large Language Models

Dejan Grubisic; Chris Cummins; Volker Seeker; Hugh Leather

Compiler generated feedback for Large Language Models

Dejan Grubisic, Chris Cummins, Volker Seeker, Hugh Leather

TL;DR

This work introduces a compiler-generated feedback loop where Large Language Models optimize LLVM IR by predicting the best optimization passes, target instruction counts, and an optimized IR, followed by compiler-based feedback that validates and refines the predictions. Three feedback forms (Short, Long, Fast) are explored, with a 7B-parameter LLaMa-2-based model trained for 20,000 steps on 64 GPUs. Results show that feedback-enhanced approaches improve over the -Oz baseline in single-shot settings (up to 0.53%), while traditional sampling by the original model can achieve up to 98% of autotuner performance with 100 samples; however, iterative feedback does not consistently beat sampling. The study demonstrates the viability and limitations of integrating LLMs with compiler optimization, highlighting sampling as a particularly potent tool and outlining future directions for smarter feedback and training data derived from feedback-driven prompts.

Abstract

We introduce a novel paradigm in compiler optimization powered by Large Language Models with compiler feedback to optimize the code size of LLVM assembly. The model takes unoptimized LLVM IR as input and produces optimized IR, the best optimization passes, and instruction counts of both unoptimized and optimized IRs. Then we compile the input with generated optimization passes and evaluate if the predicted instruction count is correct, generated IR is compilable, and corresponds to compiled code. We provide this feedback back to LLM and give it another chance to optimize code. This approach adds an extra 0.53% improvement over -Oz to the original model. Even though, adding more information with feedback seems intuitive, simple sampling techniques achieve much higher performance given 10 or more samples.

Compiler generated feedback for Large Language Models

TL;DR

Abstract

Paper Structure (14 sections, 7 figures)

This paper contains 14 sections, 7 figures.

Introduction
Feedback-directed LLMs
Feedback Metrics vs. Model Performance
The Model
Datasets
Training
Evaluation
How does the feedback model compare to the original in Task Optimize and Task Feedback?
Feedback model sampling
Additional Experiments
Feedback model iterative algorithm
Related Work
Limitations and Future Work
Conclusions

Figures (7)

Figure 1: Feedback-directed model. First, we ask in the prompt LLM to optimize the instruction count of the given IR, then LLM generates the best optimization passes, instruction counts for starting and generated IR, and generated IR itself. Next, we compile the generated pass list and create feedback by checking if the generated pass list is valid, evaluating instruction counts, examining if the generated IR contains compilation errors, and calculating the BLEU score between the generated IR and the compiled IR. If some of the parameters of the feedback is problematic, we extend the original prompt with generation, compiled code, and feedback and ask it to try again.
Figure 2: Prompt structure of Feedback models. Short Feedback is the smallest in size and extends the prompt with just calculated metrics and error messages. Long Feedback contains the most information including compiled IR. Fast Feedback is the fastest to generate since it doesn't need the generation of IR to be calculated.
Figure 3: Correlation heatmap of metrics available at inference time. Input and output prompts are described with prefixes (src, tgt). Instruction counts are abbreviated with inst_count. (G) stands for generation while (C) stands for compiled.
Figure 4: Distribution of absolute error in predicting optimized IR instruction count and Bleu score with respect to performance compared to autotuner.
Figure 5: Comparison of the original and feedback models in reducing instruction count. The upper figure shows the performance of Task Optimize. The lower figure shows the performance on Task Feedback, where each model uses their format for feedback. Horizontally, we show the performance on all examples, examples where the autotuner's best pass is non-Oz, examples where the original model was worse than the autotuner, and examples where the original model mispredicted target instruction count. All the models keep the ability to perform Task Optimize while improving the performance when feedback is provided.
...and 2 more figures

Compiler generated feedback for Large Language Models

TL;DR

Abstract

Compiler generated feedback for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)