QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code
Hainan Fang, Yuanbo Wen, Jun Bi, Yihan Wang, Tonghui He, Yanlin Tang, Di Huang, Jiaming Guo, Rui Zhang, Qi Guo, Yunji Chen
TL;DR
This paper tackles the difficulty of building reliable compilers by proposing Neural Compilation as a pathway to translate LLVM IR to assembly using large language models. It introduces NeuComBack, a dedicated IR-to-assembly benchmark with two levels (L1 from ExeBench and L2 from TSVC) to evaluate both functional correctness and optimization potential, and defines a foundational Neural Compilation workflow. A key contribution is a self-evolving prompt optimization method that learns from the model's own self-debugging traces to iteratively improve assembly generation, resulting in substantial gains in correctness and performance—on x86_64, correctness rises from 4.4e1% to 6.4e1%, and several correct programs surpass clang-O3. The results across architectures and datasets suggest that guided prompts can unlock competitive or superior low-level optimizations, highlighting the practical potential of LLM-driven neural compilation while outlining directions for broader benchmarking and robustness improvements.
Abstract
Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.
