Agentic Harness for Real-World Compilers

Yingwei Zheng; Cong Li; Shaohua Li; Yuqun Zhang; Zhendong Su

Agentic Harness for Real-World Compilers

Yingwei Zheng, Cong Li, Shaohua Li, Yuqun Zhang, Zhendong Su

Abstract

Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross-domain expertise requirements, and sparse, non-descriptive bug reports, necessitating compiler-specific tools. To bridge the gap, we introduce llvm-autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm-autofix are agent-friendly LLVM tools, a benchmark llvm-bench of reproducible LLVM bugs, and a tailored minimal agent llvm-autofix-mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm-autofix-mini also outperforms the state-of-the-art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: https://github.com/dtcxzyw/llvm-autofix

Agentic Harness for Real-World Compilers

Abstract

Paper Structure (30 sections, 8 figures, 8 tables)

This paper contains 30 sections, 8 figures, 8 tables.

Introduction
The [0.0]llvm-autofix Harness
Harness Tooling
The [0.0]llvm-bench Benchmark
The [0.0]llvm-autofix-mini Agent
Experiment Setup
Experiment Results
Benchmark and Model Performance
Baseline Comparison and Common Failures
Genuine Performance via Expert Review
Discussion
LLMs for Compilers: Open Challenges
Related Work
Conclusion
[0.0]llvm-bench: More Details
...and 15 more sections

Figures (8)

Figure 1: Compiler issues are challenging to diagnose and repair, in the absence of descriptive information. This is a comparison between LLVM issues (crash and miscompilation) and common software issues from Django; all three issues are simplified for brevity.
Figure 2: Distribution of Affected Components in [0.0]llvm-bench
Figure 3: As splits increase in difficulty, frontier models tend to struggle or fail. [0.0]G4, [0.0]G5, [0.0]GM, [0.0]QW, and [0.0]DS are short for GPT 4o, GPT 5, Gemini 2.5 Pro, Qwen 3 Max, and DeepSeek V3.2, respectively.
Figure 4: Failure distribution of unresolved issues. [0.0]G4, [0.0]G5, [0.0]GM, [0.0]QW, and [0.0]DS are short for GPT 4o, GPT 5, Gemini 2.5 Pro, Qwen 3 Max, and DeepSeek V3.2, respectively.
Figure 5: As splits increase in difficulty, frontier model's genuine capability degrades fast: No models can handle the [0.0]hard split except GPT 5 ([0.0]llvm-autofix-mini). [0.0]G4, [0.0]G5, [0.0]GM, [0.0]QW, and [0.0]DS are short for GPT 4o, GPT 5, Gemini 2.5 Pro, Qwen 3 Max, and DeepSeek V3.2, respectively. Blue bullets ($\cdot$) represent the rate of genuinely resolved issues, same as "% Resolved" in \ref{['tab:human-study-results']}.
...and 3 more figures

Agentic Harness for Real-World Compilers

Abstract

Agentic Harness for Real-World Compilers

Authors

Abstract

Table of Contents

Figures (8)