Table of Contents
Fetching ...

Towards LLM-based optimization compilers. Can LLMs learn how to apply a single peephole optimization? Reasoning is all LLMs need!

Xiangxin Fang, Lev Mukhanov

TL;DR

This study investigates whether large language models can learn and apply a peephole optimization to AArch64 code, using a fine-tuned 7B-Llama2 baseline and comparing it with OpenAI GPT-4o and GPT-o1 (preview). The authors generate training data from LLVM-based peephole transformations and evaluate model outputs with BLEU, EMR, Syntactic, and IO metrics, revealing that surface accuracy does not guarantee correct optimization. A key finding is that chain-of-thought reasoning, as implemented in GPT-o1, substantially improves optimization performance, outperforming both GPT-4o and the fine-tuned Llama2 on selected tests. The results suggest that enhanced reasoning mechanisms are crucial for reliable code optimization by LLMs, with significant implications for future research in AI-assisted compilers and code generation.

Abstract

Large Language Models (LLMs) have demonstrated great potential in various language processing tasks, and recent studies have explored their application in compiler optimizations. However, all these studies focus on the conventional open-source LLMs, such as Llama2, which lack enhanced reasoning mechanisms. In this study, we investigate the errors produced by the fine-tuned 7B-parameter Llama2 model as it attempts to learn and apply a simple peephole optimization for the AArch64 assembly code. We provide an analysis of the errors produced by the LLM and compare it with state-of-the-art OpenAI models which implement advanced reasoning logic, including GPT-4o and GPT-o1 (preview). We demonstrate that OpenAI GPT-o1, despite not being fine-tuned, outperforms the fine-tuned Llama2 and GPT-4o. Our findings indicate that this advantage is largely due to the chain-of-thought reasoning implemented in GPT-o1. We hope our work will inspire further research on using LLMs with enhanced reasoning mechanisms and chain-of-thought for code generation and optimization.

Towards LLM-based optimization compilers. Can LLMs learn how to apply a single peephole optimization? Reasoning is all LLMs need!

TL;DR

This study investigates whether large language models can learn and apply a peephole optimization to AArch64 code, using a fine-tuned 7B-Llama2 baseline and comparing it with OpenAI GPT-4o and GPT-o1 (preview). The authors generate training data from LLVM-based peephole transformations and evaluate model outputs with BLEU, EMR, Syntactic, and IO metrics, revealing that surface accuracy does not guarantee correct optimization. A key finding is that chain-of-thought reasoning, as implemented in GPT-o1, substantially improves optimization performance, outperforming both GPT-4o and the fine-tuned Llama2 on selected tests. The results suggest that enhanced reasoning mechanisms are crucial for reliable code optimization by LLMs, with significant implications for future research in AI-assisted compilers and code generation.

Abstract

Large Language Models (LLMs) have demonstrated great potential in various language processing tasks, and recent studies have explored their application in compiler optimizations. However, all these studies focus on the conventional open-source LLMs, such as Llama2, which lack enhanced reasoning mechanisms. In this study, we investigate the errors produced by the fine-tuned 7B-parameter Llama2 model as it attempts to learn and apply a simple peephole optimization for the AArch64 assembly code. We provide an analysis of the errors produced by the LLM and compare it with state-of-the-art OpenAI models which implement advanced reasoning logic, including GPT-4o and GPT-o1 (preview). We demonstrate that OpenAI GPT-o1, despite not being fine-tuned, outperforms the fine-tuned Llama2 and GPT-4o. Our findings indicate that this advantage is largely due to the chain-of-thought reasoning implemented in GPT-o1. We hope our work will inspire further research on using LLMs with enhanced reasoning mechanisms and chain-of-thought for code generation and optimization.

Paper Structure

This paper contains 16 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Basic block generation pipeline.
  • Figure 2: The code optimization pipeline.
  • Figure 3: Change in target LLM accuracy metrics with the number of tokens used for fine-tuning.
  • Figure 4: Model performance on different test sets.
  • Figure 5: Correlation between EMR (Exact Match Rate) and the number of prompt shots.
  • ...and 3 more figures