Finding Missed Code Size Optimizations in Compilers using LLMs

Davide Italiano; Chris Cummins

Finding Missed Code Size Optimizations in Compilers using LLMs

Davide Italiano, Chris Cummins

TL;DR

This work addresses the problem of missed code size optimizations in modern compilers by integrating large language models with differential testing to expose optimization gaps. The authors implement a simple, extensible framework that mutates a seed program with an off-the-shelf LLM and applies four differential testing strategies to surface suspicious compilations. They report 24 production-bug findings across C/C++, Rust, and Swift, using under 150 lines of code and open-sourcing the tool. The approach demonstrates that LLM-assisted mutation testing can effectively uncover real-world compiler optimization bugs and can be extended to additional languages with minimal changes, offering a practical path toward improving compiler performance across ecosystems.

Abstract

Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production compilers. The majority of effort has been expended on validating that a compiler produces correct code for a given input, while less attention has been paid to ensuring that the compiler produces performant code. In this work we adapt differential testing to the task of identifying missed optimization opportunities in compilers. We develop a novel testing approach which combines large language models (LLMs) with a series of differential testing strategies and use them to find missing code size optimizations in C / C++ compilers. The advantage of our approach is its simplicity. We offload the complex task of generating random code to an off-the-shelf LLM, and use heuristics and analyses to identify anomalous compiler behavior. Our approach requires fewer than 150 lines of code to implement. This simplicity makes it extensible. By simply changing the target compiler and initial LLM prompt we port the approach from C / C++ to Rust and Swift, finding bugs in both. To date we have reported 24 confirmed bugs in production compilers, and conclude that LLM-assisted testing is a promising avenue for detecting optimization bugs in real world compilers.

Finding Missed Code Size Optimizations in Compilers using LLMs

TL;DR

Abstract

Paper Structure (38 sections, 16 figures, 3 tables)

This paper contains 38 sections, 16 figures, 3 tables.

Introduction
Methodology
Mutating code using LLMs
Seed program
Mutation prompts
Mutation instructions
Differential testing strategies for discovering missed optimizations
Dead code differential testing
Optimization pipeline differential testing
Single-compiler differential testing
Multi-compiler differential testing
Detecting false positives
Detecting duplicates
Finding bugs in C / C++ compilers
Experimental Setup
...and 23 more sections

Figures (16)

Figure 1: An example of our technique. We instruct an LLM to incrementally mutate a program by randomly sampling a predetermined list of instructions. At each mutation step, an automatic differential testing strategy is used to detect missed optimizations. For this particular example one minute of compute was used and a 36% code size regression was discovered.
Figure 2: Workflow of the automated testing methodology. The system takes two inputs provided by the user: a seed code and a list of mutation instructions (Section \ref{['sec:mutation']}). Execution iterates until the code mutated by the LLM no longer compiles, or until a series of differential tests and analyses detect a suspicious compilation and trigger a violation (Section \ref{['sec:detecting-suspicious-compilations']}).
Figure 3: Seed programs for different programming languages, used as the starting point for mutation. In all three languages the seed code contains a single empty function with an integer argument. From this, the LLM incrementally expands the scope and complexity of the code, directed by our mutation prompts.
Figure 4: Template used to generate LLM prompts.
Figure 5: The addition of the two dead conditionals in (b) exposed a regression in GCC where Value Range Analysis fails to prove that the code is dead.
...and 11 more figures

Finding Missed Code Size Optimizations in Compilers using LLMs

TL;DR

Abstract

Finding Missed Code Size Optimizations in Compilers using LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (16)