ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution
Lars Gröninger, Beatriz Souza, Michael Pradel
TL;DR
ChangeGuard tackles the problem of automatically validating whether code changes preserve runtime behavior, a difficult task when changes occur deep inside large projects. It introduces pairwise learning-guided execution, which runs the old and new function versions side-by-side, merging them into a comparison program and injecting diverse, project-specific values to explore execution paths. The approach significantly improves robustness and coverage over prior learning-guided execution, achieving precision 77.1% and recall 69.5% on manually annotated changes, and substantially increasing code-path coverage (median ~92%) while solving many semantics-changing cases that regression tests miss. Evaluations on manually annotated commits and automated refactorings (including GPT-generated changes) demonstrate that ChangeGuard can effectively detect semantics changes that others miss, enabling earlier detection of unintended behavior changes and providing a practical validation step for automated code transformations.
Abstract
Code changes are an integral part of the software development process. Many code changes are meant to improve the code without changing its functional behavior, e.g., refactorings and performance improvements. Unfortunately, validating whether a code change preserves the behavior is non-trivial, particularly when the code change is performed deep inside a complex project. This paper presents ChangeGuard, an approach that uses learning-guided execution to compare the runtime behavior of a modified function. The approach is enabled by the novel concept of pairwise learning-guided execution and by a set of techniques that improve the robustness and coverage of the state-of-the-art learning-guided execution technique. Our evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from popular Python open-source projects and to three datasets of code changes obtained by applying automated code transformations. Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%, and that it detects unexpected behavioral changes introduced by automatic code refactoring tools. In contrast, the existing regression tests of the analyzed projects miss the vast majority of semantics-changing code changes, with a recall of only 7.6%. We envision our approach being useful for detecting unintended behavioral changes early in the development process and for improving the quality of automated code transformations.
