Compiler Testing With Relaxed Memory Models
Luke Geeson, Lee Smith
TL;DR
This work targets compiler correctness under relaxed memory models for concurrent C/C++ programs by introducing Téléchat, an automated, model-based testing tool. Téléchat uses the herd simulator to compare outcomes of source programs under source memory models with compiled programs under architecture models, detecting bugs when $outcomes(exec(comp(S),M_{C})) \not\subseteq outcomes(exec(S),M_{S})$. The approach is parameterised over multiple memory models, scalable through compilation-time optimisations, and designed for automated regression testing in industry—demonstrated by deployment within Arm and large-scale differential testing across LLVM and GCC. The evaluation shows Téléchat finds behaviours missed by prior work, reports new concurrency bugs and model issues, and highlights practical industry impact while also outlining limitations and future directions such as model completeness and state-explosion challenges.
Abstract
Finding bugs is key to the correctness of compilers in wide use today. If the behaviour of a compiled program, as allowed by its architecture memory model, is not a behaviour of the source program under its source model, then there is a bug. This holds for all programs, but we focus on concurrency bugs that occur only with two or more threads of execution. We focus on testing techniques that detect such bugs in C/C++ compilers. We seek a testing technique that automatically covers concurrency bugs up to fixed bounds on program sizes and that scales to find bugs in compiled programs with many lines of code. Otherwise, a testing technique can miss bugs. Unfortunately, the state-of-the-art techniques are yet to satisfy all of these properties. We present the Téléchat compiler testing tool for concurrent programs. Téléchat compiles a concurrent C/C++ program and compares source and compiled program behaviours using source and architecture memory models. We make three claims: Téléchat improves the state-of-the-art at finding bugs in code generation for multi-threaded execution, it is the first public description of a compiler testing tool for concurrency that is deployed in industry, and it is the first tool that takes a significant step towards the desired properties. We provide experimental evidence suggesting Téléchat finds bugs missed by other state-of-the-art techniques, case studies indicating that Téléchat satisfies the properties, and reports of our experience deploying Téléchat in industry regression testing.
