Table of Contents
Fetching ...

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Hainan Fang, Yuanbo Wen, Jun Bi, Yihan Wang, Tonghui He, Yanlin Tang, Di Huang, Jiaming Guo, Rui Zhang, Qi Guo, Yunji Chen

TL;DR

This paper tackles the difficulty of building reliable compilers by proposing Neural Compilation as a pathway to translate LLVM IR to assembly using large language models. It introduces NeuComBack, a dedicated IR-to-assembly benchmark with two levels (L1 from ExeBench and L2 from TSVC) to evaluate both functional correctness and optimization potential, and defines a foundational Neural Compilation workflow. A key contribution is a self-evolving prompt optimization method that learns from the model's own self-debugging traces to iteratively improve assembly generation, resulting in substantial gains in correctness and performance—on x86_64, correctness rises from 4.4e1% to 6.4e1%, and several correct programs surpass clang-O3. The results across architectures and datasets suggest that guided prompts can unlock competitive or superior low-level optimizations, highlighting the practical potential of LLM-driven neural compilation while outlining directions for broader benchmarking and robustness improvements.

Abstract

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

TL;DR

This paper tackles the difficulty of building reliable compilers by proposing Neural Compilation as a pathway to translate LLVM IR to assembly using large language models. It introduces NeuComBack, a dedicated IR-to-assembly benchmark with two levels (L1 from ExeBench and L2 from TSVC) to evaluate both functional correctness and optimization potential, and defines a foundational Neural Compilation workflow. A key contribution is a self-evolving prompt optimization method that learns from the model's own self-debugging traces to iteratively improve assembly generation, resulting in substantial gains in correctness and performance—on x86_64, correctness rises from 4.4e1% to 6.4e1%, and several correct programs surpass clang-O3. The results across architectures and datasets suggest that guided prompts can unlock competitive or superior low-level optimizations, highlighting the practical potential of LLM-driven neural compilation while outlining directions for broader benchmarking and robustness improvements.

Abstract

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.

Paper Structure

This paper contains 32 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Pipeline of our automatic prompt learning method on Neural Compilation.
  • Figure 2: Original ExeBench C code statistics. This figure illustrates the distribution of C code lines in the programs from the original ExeBench dataset (Test Set).
  • Figure 3: A case of performance self-optimization (s452)
  • Figure 4: A case of performance better than O3 (s332)