Table of Contents
Fetching ...

Agentic Harness for Real-World Compilers

Yingwei Zheng, Cong Li, Shaohua Li, Yuqun Zhang, Zhendong Su

Abstract

Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross-domain expertise requirements, and sparse, non-descriptive bug reports, necessitating compiler-specific tools. To bridge the gap, we introduce llvm-autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm-autofix are agent-friendly LLVM tools, a benchmark llvm-bench of reproducible LLVM bugs, and a tailored minimal agent llvm-autofix-mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm-autofix-mini also outperforms the state-of-the-art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: https://github.com/dtcxzyw/llvm-autofix

Agentic Harness for Real-World Compilers

Abstract

Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross-domain expertise requirements, and sparse, non-descriptive bug reports, necessitating compiler-specific tools. To bridge the gap, we introduce llvm-autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm-autofix are agent-friendly LLVM tools, a benchmark llvm-bench of reproducible LLVM bugs, and a tailored minimal agent llvm-autofix-mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm-autofix-mini also outperforms the state-of-the-art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: https://github.com/dtcxzyw/llvm-autofix
Paper Structure (30 sections, 8 figures, 8 tables)

This paper contains 30 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Compiler issues are challenging to diagnose and repair, in the absence of descriptive information. This is a comparison between LLVM issues (crash and miscompilation) and common software issues from Django; all three issues are simplified for brevity.
  • Figure 2: Distribution of Affected Components in [0.0]llvm-bench
  • Figure 3: As splits increase in difficulty, frontier models tend to struggle or fail. [0.0]G4, [0.0]G5, [0.0]GM, [0.0]QW, and [0.0]DS are short for GPT 4o, GPT 5, Gemini 2.5 Pro, Qwen 3 Max, and DeepSeek V3.2, respectively.
  • Figure 4: Failure distribution of unresolved issues. [0.0]G4, [0.0]G5, [0.0]GM, [0.0]QW, and [0.0]DS are short for GPT 4o, GPT 5, Gemini 2.5 Pro, Qwen 3 Max, and DeepSeek V3.2, respectively.
  • Figure 5: As splits increase in difficulty, frontier model's genuine capability degrades fast: No models can handle the [0.0]hard split except GPT 5 ([0.0]llvm-autofix-mini). [0.0]G4, [0.0]G5, [0.0]GM, [0.0]QW, and [0.0]DS are short for GPT 4o, GPT 5, Gemini 2.5 Pro, Qwen 3 Max, and DeepSeek V3.2, respectively. Blue bullets ($\cdot$) represent the rate of genuinely resolved issues, same as "% Resolved" in \ref{['tab:human-study-results']}.
  • ...and 3 more figures