RunBugRun -- An Executable Dataset for Automated Program Repair

Julian Aron Prenner; Romain Robbes

RunBugRun -- An Executable Dataset for Automated Program Repair

Julian Aron Prenner, Romain Robbes

TL;DR

RunBugRun delivers a large-scale, fully executable APR dataset spanning eight languages, paired with a sandboxed infrastructure for compilation, execution, and test-based evaluation. It bridges traditional generate-and-validate and neural program repair by combining executable ground truth with data-scale typical of NPR, enabling semantic correctness assessments at scale. The paper provides extensive data curation, fine-grained bug-labeling, and baseline evaluations (Cardumen and CodeT5) to map current capabilities and gaps, while demonstrating promising cross-language transfer and the value of execution-informed signals. Overall, RunBugRun advances scalable, multilingual, execution-enabled APR research with practical implications for more robust and diverse repair systems.

Abstract

Recently, we can notice a transition to data-driven techniques in Automated Program Repair (APR), in particular towards deep neural networks. This entails training on hundreds of thousands or even millions of non-executable code fragments. We would like to bring more attention to an aspect of code often neglected in Neural Program Repair (NPR), namely its execution. Code execution has several significant advantages. It allows for test-based evaluation of candidate fixes and can provide valuable information to aid repair. In this work we present a fully executable dataset of 450,000 small buggy/fixed program pairs originally submitted to programming competition websites written in eight different programming languages. Along with the dataset we provide infrastructure to compile, safely execute and test programs as well as fine-grained bug-type labels. To give a point of reference, we provide basic evaluation results for two baselines, one based on a generate-and-validate approach and one on deep learning. With this dataset we follow several goals: we want to lift Neural Program Repair beyond fully static code representations, foster the use of execution-based features and, by including several different languages, counterbalance the predominance of Java in the current landscape of APR datasets and benchmarks.

RunBugRun -- An Executable Dataset for Automated Program Repair

TL;DR

Abstract

Paper Structure (58 sections, 1 equation, 6 figures, 4 tables)

This paper contains 58 sections, 1 equation, 6 figures, 4 tables.

Introduction
The need for execution at scale.
The need for multi-lingual APR
The need for better curation and insights.
RunBugRun.
Related Work
APR Benchmarks
NPR Datasets
Works based on contest submissions
Motivation
APR needs a large-scale and executable dataset
Moving NPR and G&V closer
Benchmark overfitting issues
Test-based evaluation improves over textual evaluations
Test-based evaluation should be made more efficient
...and 43 more sections

Figures (6)

Figure 1: Example of a bug instance from the presented dataset, along with metadata. Each bug comes with the programming language used, the split (i.e., train, valid or test), one or more hierarchical bug labels, a list of failed and passed test cases and one or more error messages in case a runtime error or exception occurred.See Figures \ref{['fig:example-bug-ruby']} and \ref{['fig:example-bug-java']} for more examples.
Figure 2: Two alternative patches for a Ruby bug in our dataset. The result of line 5 remains unused. Patch 1 (line 5) uses a method variant that modifies line in-place; patch 2 (line 6) reassigns the result to line, leading to the same result. This shows that determining patch correctness statically (i.e., without execution) often results in false negatives.
Figure 3: A fix from the TSSB-3M dataset richterTSSB3MMiningSingle2022, with three million instances, currently one of the largest APR datasets available. The dataset contains non-executable code fragments. Although this dataset instance is marked with "likely bug", the change (updating of the copyright year) is unlikely to resolve a functional bug.
Figure 4: Language distribution for the training, test and validation set, respectively.
Figure 5: Distribution of the number of changes from single change (at the very left) to six changes (at the very right).
...and 1 more figures

RunBugRun -- An Executable Dataset for Automated Program Repair

TL;DR

Abstract

RunBugRun -- An Executable Dataset for Automated Program Repair

Authors

TL;DR

Abstract

Table of Contents

Figures (6)